[SERVER-41584] Deadlock shutting down the PeriodicRunner while the LogicalSessionCache is refreshing Created: 07/Jun/19  Updated: 25/Mar/20  Resolved: 25/Mar/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthew Saltz (Inactive) Assignee: Amirsaman Memaripour
Resolution: Duplicate Votes: 0
Labels: sharding-4.4-stabilization, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-44279 Make LogicalSessionCache synchronize ... Closed
Related
related to SERVER-46841 Make PeriodicRunner interrupt blocked... Closed
is related to SERVER-46841 Make PeriodicRunner interrupt blocked... Closed
Operating System: ALL
Participants:
Linked BF Score: 18

 Description   

There is a deadlock in the following scenario:

  1. The secondary nodes of the config server shard are shut down cleanly
  2. The test runner then attempts to shut down the primary shard of the config server (on mongod, process 28261, see stack trace above) and the shutdown process is initiated
  3. The LogicalSessionCache then begins a refresh, which tries to create config.system.sessions. This sends ShardsvrShardCollection to one of the shards.
  4. The shard then sends ConfigsvrCreateCollection back to the config server primary, which then tries to take the database dist lock, and then hangs in ReplicationCoordinatorImpl::waitUntilOpTimeForRead because the secondaries are down and the optime never advances
  5. In the meantime, the shutdown thread advances to shut down the PeriodicRunner, but the LogicalSessionCacheRefresh job is running inside the PeriodicRunner, and it's hanging, so the PeriodicRunner can never shut down.

There are a lot of potential fixes, including:

  1. Separate PeriodicRunner shutdown from join, so that we can kill all operations after shutting down the runner and then join jobs after that
  2. Put a timeout on the read for the distlock acquisition


 Comments   
Comment by Amirsaman Memaripour [ 25/Mar/20 ]

SERVER-46841 should fix this issue.

Comment by Misha Tyulenev [ 27/Jun/19 ]

This issue is separate from SERVER-41217 as its a deadlock not related to ShardRegistry reload. However, it should be fixed and I think that the fixes can be related.

Generated at Thu Feb 08 04:58:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.