[SERVER-67644] Certain ServerStatusMetricFields can lead to a data race Created: 29/Jun/22  Updated: 29/Oct/23  Resolved: 14/Aug/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.1.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Daniel Morilha (Inactive) Assignee: Billy Donahue
Resolution: Fixed Votes: 0
Labels: tsan
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-65987 ServerStatusMetric API refresh Closed
is related to SERVER-67333 ServerStatusMetric API refresh follow-up Closed
Backwards Compatibility: Fully Compatible
Sprint: Service Arch 2022-08-22
Participants:

 Description   

Unless ServerStatusMetricField relies on a storage container which provides thread-safe primitives, the serverStatus command can report corrupted data if one thread is reading the contents of a field while another is writing to it.

The current code base contains at least one instance of it where a ServerStatusMetricField is built out of a std::string which doesn't guarantee thread safety.

It is worth to notice that this should be carefully designed as it could lead to service degradation due to the introduction of currently inexistent contention and metrics are spread throughout all of the code base.

Acceptance Criteria: Either create a mechanism to atomically swap the contents of a ServerStatusMetricField or make sure all stored types are thread-safe by design.



 Comments   
Comment by Githook User [ 14/Aug/22 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-67644 fix racy std::string metric
Branch: master
https://github.com/mongodb/mongo/commit/ad7b2d0fa2e8557bcb83170d599992710b883db4

Comment by Billy Donahue [ 13/Aug/22 ]

We have a `synchronized_value<T>` that we can leverage here to make arbitrary metrics thread-safe.
I had to use that anyway for the std::string metric fix so we might as well make it generally available.
Then we would support anything that can be .append'ed to a BSONObjBuilder.

Comment by Alex Neben [ 12/Aug/22 ]

This was picked up by the thread sanitizer and I am not convinced this is a good fix. I think while this API is relatively new we can make it race free (so a situation like this can never happen in the future).

I suggest we do something similar to https://openmetrics.io/ (which is the same as Prometheus) where they provide ~5 types of metrics. makeServerStatusMetric could be split into about 5 functions makeServerStatusCounter/makeServerStatusSafeString/makeServerStatusNumber/etc that provide thread safe metrics and prevent an issue like this in the future.

Comment by Billy Donahue [ 12/Jul/22 ]

After looking at the ServerStatusMetric uses, I think the "lastStateTransition" std::string metric is the only case of a race.

https://github.com/10gen/mongo/blob/35eb6deef0f47505104e8ff066c695369776c070/src/mongo/db/repl/replication_coordinator_impl.cpp#L166-L169

One possible fix would be changing it from std::string to synchronized_value<std::string>. This might work because synchronized_value is assignable convertible from its value_type. but it might not work. Depends on how the metric is used. Maybe a generic synchronized_value metric would be useful.

On the other hand maybe that one case where repl is using a string as a metric can be done in a different way.

Generated at Thu Feb 08 06:08:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.