-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Sharding
-
ALL
-
18
There is a deadlock in the following scenario:
- The secondary nodes of the config server shard are shut down cleanly
- The test runner then attempts to shut down the primary shard of the config server (on mongod, process 28261, see stack trace above) and the shutdown process is initiated
- The LogicalSessionCache then begins a refresh, which tries to create config.system.sessions. This sends ShardsvrShardCollection to one of the shards.
- The shard then sends ConfigsvrCreateCollection back to the config server primary, which then tries to take the database dist lock, and then hangs in ReplicationCoordinatorImpl::waitUntilOpTimeForRead because the secondaries are down and the optime never advances
- In the meantime, the shutdown thread advances to shut down the PeriodicRunner, but the LogicalSessionCacheRefresh job is running inside the PeriodicRunner, and it's hanging, so the PeriodicRunner can never shut down.
There are a lot of potential fixes, including:
- Separate PeriodicRunner shutdown from join, so that we can kill all operations after shutting down the runner and then join jobs after that
- Put a timeout on the read for the distlock acquisition
- is duplicated by
-
SERVER-44279 Make LogicalSessionCache synchronize with system shutdown
- Closed
- is related to
-
SERVER-46841 Make PeriodicRunner interrupt blocked operations on stop
- Closed
- related to
-
SERVER-46841 Make PeriodicRunner interrupt blocked operations on stop
- Closed