Loading...

XML

Word

Printable

JSON

There is a deadlock in the following scenario:

The secondary nodes of the config server shard are shut down cleanly
The test runner then attempts to shut down the primary shard of the config server (on mongod, process 28261, see stack trace above) and the shutdown process is initiated
The LogicalSessionCache then begins a refresh, which tries to create config.system.sessions. This sends ShardsvrShardCollection to one of the shards.
The shard then sends ConfigsvrCreateCollection back to the config server primary, which then tries to take the database dist lock, and then hangs in ReplicationCoordinatorImpl::waitUntilOpTimeForRead because the secondaries are down and the optime never advances
In the meantime, the shutdown thread advances to shut down the PeriodicRunner, but the LogicalSessionCacheRefresh job is running inside the PeriodicRunner, and it's hanging, so the PeriodicRunner can never shut down.

There are a lot of potential fixes, including:

Separate PeriodicRunner shutdown from join, so that we can kill all operations after shutting down the runner and then join jobs after that
Put a timeout on the read for the distlock acquisition

is duplicated by

SERVER-44279 Make LogicalSessionCache synchronize with system shutdown

is related to

SERVER-46841 Make PeriodicRunner interrupt blocked operations on stop

related to

SERVER-46841 Make PeriodicRunner interrupt blocked operations on stop