[SERVER-51055] ServiceContext getters/setters are racing Created: 18/Sep/20  Updated: 12/Dec/23

Status: Open
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Billy Donahue Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 0
Labels: sa-remove-fv-backlog-22
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-84118 Investigate where to reset FastClockS... Closed
Gantt Dependency
has to be done after SERVER-58450 Avoid resetting `ServiceContext` memb... Closed
has to be done after SERVER-58451 Avoid resetting `ServiceContext` memb... Closed
has to be done after SERVER-58452 Avoid resetting `ServiceContext` memb... Closed
has to be done after SERVER-58453 Avoid resetting `ServiceContext` memb... Closed
has to be done after SERVER-58454 Avoid resetting `ServiceContext` memb... Closed
has to be done after SERVER-58455 Avoid resetting `ServiceContext` memb... Closed
has to be done after SERVER-58456 Avoid resetting `ServiceContext` memb... Closed
has to be done after SERVER-58457 Avoid resetting `ServiceContext` memb... Closed
has to be done after SERVER-58458 Avoid resetting `ServiceContext` memb... Closed
has to be done after SERVER-58459 Avoid resetting `ServiceContext` memb... Closed
has to be done after SERVER-58461 Avoid resetting `ServiceContext` memb... Closed
has to be done after SERVER-58462 Avoid resetting `ServiceContext` memb... Closed
Problem/Incident
causes SERVER-51165 synchronized ServiceContext getters a... Closed
Related
related to SERVER-59157 Add support for a shared-mutex type Backlog
related to SERVER-58336 Use the fixture's instance of Service... Closed
Assigned Teams:
Service Arch
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

BF-18712

Sprint: Service arch 2020-10-05, Service arch 2020-10-19, Service Arch 2021-03-08, Service Arch 2021-03-22, Service Arch 2021-04-05, Service Arch 2021-04-19, Service Arch 2021-05-17, Service Arch 2021-07-12, Service Arch 2021-08-09
Participants:
Linked BF Score: 5
Story Points: 4

 Description   

BF-18712 shows a TSAN true positive where setFastClockSource races with getFastClockSource due to latch_detail::Mutex::_onContendedLock's invocation of getFastClockSource before the setFastClockSource could occur.

There is no synchronization on the unique_ptr fields of ServiceContext, so these problems are unavoidable. Rather than making new obscure rules about when it's ok to call the getters, we can rather easily add synchronization to these fields, as they are private members. They could be stored in a synchronized unique_ptr wrapper.

 

Acceptance Criteria: 

ServiceContext will call terminate if a setter happens after a given property has been accessed. Create a patch build to reproduce this and identify all the use cases where this was happening to create tickets that require a fix. 



 Comments   
Comment by Billy Donahue [ 30/Sep/20 ]

Exploratory work to see if the setters can be forced to only occur once.

(not really for review) https://mongodbcr.appspot.com/676640003/

Patch queue: https://spruce.mongodb.com/version/5f74d32f2a60ed66c518e2cf/tasks

Already had to make a carve-out for setServiceEntryPoint, which is set twice by mongod_main.cpp, but I don't know why we do that yet.

Lots of tests in the patch queue are triggering an invariant that set() should be called only once. The tests are calling setters for setOpObserver, setTransportLayer, setTickSource, etc... multiple times. Would be some work to fix all of it.

Comment by Billy Donahue [ 28/Sep/20 ]

The atomicity of setters and getters added by commit 9ec023a45d is necessary but isn't enough to fix the ordering problem. We need further work to enforce invariants about ServiceContext setters occurring after getters.

Reopening.

Comment by Billy Donahue [ 22/Sep/20 ]

Should have fixed TSAN failures.

This is a band-aid. All that's been added is synchronized access to these unique_ptrs, so that at least the getters return a coherent value, but we still need higher-level semantics to ensure the validity of any getter-returned raw pointers. A subsequent setter can still immediately invalidate that pointer. Without shared_ptr or another kind of shared locking, we won't be able to do better than that. Frankly the getters might need to return reference-holding objects like shared_ptr's to do it right.

Comment by Githook User [ 22/Sep/20 ]

Author:

{'name': 'Billy Donahue', 'email': 'billy.donahue@mongodb.com', 'username': 'BillyDonahue'}

Message: SERVER-51055 ServiceContext::SyncUnique to fix unique_ptr races.

  • The generic template synchronized_value cannot predict the
    acquisition level of its instantiations or instances. Take a
    policy parameter to allow mutex details to be customized.
  • simplify synchronized_value implementation
  • merge const_update_guard with update_guard as 1 nested class
  • ServiceContext SyncUnique must use a raw mutex policy to avoid
    recursion when Latch Diagnostics accesses ServiceContext
    Branch: master
    https://github.com/mongodb/mongo/commit/9ec023a45dff4f8b3b837bee2c99442f0eec5316
Comment by Billy Donahue [ 18/Sep/20 ]

CR https://mongodbcr.appspot.com/673000001/

Generated at Thu Feb 08 05:24:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.