[SERVER-67644] Certain ServerStatusMetricFields can lead to a data race Created: 29/Jun/22 Updated: 29/Oct/23 Resolved: 14/Aug/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 6.1.0-rc0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Daniel Morilha (Inactive) | Assignee: | Billy Donahue |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | tsan | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Sprint: | Service Arch 2022-08-22 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Unless ServerStatusMetricField relies on a storage container which provides thread-safe primitives, the serverStatus command can report corrupted data if one thread is reading the contents of a field while another is writing to it. The current code base contains at least one instance of it where a ServerStatusMetricField is built out of a std::string which doesn't guarantee thread safety. It is worth to notice that this should be carefully designed as it could lead to service degradation due to the introduction of currently inexistent contention and metrics are spread throughout all of the code base. Acceptance Criteria: Either create a mechanism to atomically swap the contents of a ServerStatusMetricField or make sure all stored types are thread-safe by design. |
| Comments |
| Comment by Githook User [ 14/Aug/22 ] |
|
Author: {'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}Message: |
| Comment by Billy Donahue [ 13/Aug/22 ] |
|
We have a `synchronized_value<T>` that we can leverage here to make arbitrary metrics thread-safe. |
| Comment by Alex Neben [ 12/Aug/22 ] |
|
This was picked up by the thread sanitizer and I am not convinced this is a good fix. I think while this API is relatively new we can make it race free (so a situation like this can never happen in the future). I suggest we do something similar to https://openmetrics.io/ (which is the same as Prometheus) where they provide ~5 types of metrics. makeServerStatusMetric could be split into about 5 functions makeServerStatusCounter/makeServerStatusSafeString/makeServerStatusNumber/etc that provide thread safe metrics and prevent an issue like this in the future. |
| Comment by Billy Donahue [ 12/Jul/22 ] |
|
After looking at the ServerStatusMetric uses, I think the "lastStateTransition" std::string metric is the only case of a race. One possible fix would be changing it from std::string to synchronized_value<std::string>. This might work because synchronized_value is assignable convertible from its value_type. but it might not work. Depends on how the metric is used. Maybe a generic synchronized_value metric would be useful. On the other hand maybe that one case where repl is using a string as a metric can be done in a different way. |