[SERVER-66303] Ensure ReplicaSetMonitorManager shuts down before ServiceContextTest resets the global context Created: 06/May/22  Updated: 29/Oct/23  Resolved: 29/Jun/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.1.0-rc0

Type: Task Priority: Major - P3
Reporter: Janna Golden Assignee: Randolph Tan
Resolution: Fixed Votes: 0
Labels: neweng, sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-67478 Get rid of getGlobalServiceContext Backlog
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2022-06-27, Sharding 2022-07-11
Participants:
Linked BF Score: 14

 Description   

The issue is that setGlobalServiceContext is not thread safe. getGlobalServiceContext is also written in a way that assumes that setGlobalServiceContext is only called once at the beginning when there is only a single thread of context. Cpp tests that repeatedly calls setGlobalServiceContext can run into race with setGlobalServiceContext setting the global variable to null and before the destructor of the previous service context gets run (which is right after the end of this scope). Since the set and get are not protected by mutex, threads owned by the old service context trying to call getGlobalServiceContext will be able to observe that glolbal context variable has been set to null even before it completely gets destroyed.

One particular workaround for the cpp tests is to shutdown and join the threads that is known to access getGlobalServiceContext before calling setGlobalServiceContext or replace the getGlobalServiceContext calls with references to serviceContext and make sure that ServiceContext destructor joins all threads that can reference it.

Original description:

The RSMM is a decoration on the ServiceContext. For any unit tests that inherit from ServiceContextTest and link in the RSM/RSMM, ServiceContext's destructor is called before the ReplicaSetMonitorManager's (because the RSMM's destructor is called as part of ~DecorationContainer, which is a member of Decorable objects). The RSM uses the RSMM's executor, which is shutdown in RSMM::shutdown(), which is called from ~RSMM. So, its possible for a task still to exist on this executor, and run after the ServiceContext has been destroyed - if the task tries to grab the service context, the test will crash (as happened in the linked BF).

We attempted to fix this by using ServiceContext::ConstructorActionRegisterer to declare a destructor to be called before the ServiceContext is destroyed, but the ServiceContext used in ServiceContextTest isn't constructed using UniqueServiceContext, so it won't get registered.

You can repro the issue in the linked BF by placing a call to getGlobalServiceContext() in RSMM::shutdown() (and running the test that failed in the BF - tenant_migration_donor_service_test).



 Comments   
Comment by Githook User [ 29/Jun/22 ]

Author:

{'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}

Message: SERVER-66303 Ensure ReplicaSetMonitorManager shuts down before ServiceContextTest resets the global context
Branch: master
https://github.com/mongodb/mongo/commit/cef61424315589689c9c764b798494268239340e

Comment by Max Hirschhorn [ 13/Jun/22 ]

We attempted to fix this by using ServiceContext::ConstructorActionRegisterer to declare a destructor to be called before the ServiceContext is destroyed, but the ServiceContext used in ServiceContextTest isn't constructed using UniqueServiceContext, so it won't get registered.

If the ServiceContext used by C++ unit tests isn't respecting its ConstructorActionRegisterer, then it seems like the C++ test environment isn't being faithful to a typical mongod environment. This seems like an issue which would extend beyond the ReplicaSetMonitorManager. We should see if that's the case and engage with the Service Arch team as needed.

Otherwise, we should look into solutions which have perhaps have the TenantMigrationDonorServiceTest test fixture explicitly shut down the ReplicaSetMonitors it ended up starting as part of its tearDown() method (for example, what dbclient_rs_test.cpp does).

Generated at Thu Feb 08 06:05:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.