[SERVER-47175] Possible shutdown-order deadlock between LogicalSessionsCache and ReplicationCoordinator Created: 30/Mar/20  Updated: 06/Dec/22  Resolved: 10/Apr/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.2.5, 4.0.17
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Sharding
Operating System: ALL
Participants:

 Description   

There is a possibility for shutdown-order deadlock between the LogicalSessionsCache and the ReplicationCoordinator, which looks like this:

The LogicalSessionsCache's thread calls into the catalog cache in order to fetch routing info for the config.system.sessions collection.

The catalog cache has been performing network operations (which convert to local storage engine/disk operations on the config server) under a mutex since the beginning of time. This means that if called at the inopportune moment by the LogicalSessionCache, it could cause its thread to block waiting for the majority snapshot to advance (the call under a mutex doesn't have a relevance here, but the fact that the operations convert to local reads on the config server due to ShardLocal does).

The LogicalSessionsCache is shut down and joined before the transport layer and all of this happens before the ReplicationCoordinator::shutdown. This means that the replication coordinanator depends on the LogicalSessionCache to shutdown, before it itself shuts down, which is a circular dependency.

The only thing that holds this deadlock from happening is that the shutdown command happens to first step down the replication coordinator, but this is a bit of a coincidental and lucky occurrence that could be inadvertently broken.



 Comments   
Comment by Ratika Gandhi [ 10/Apr/20 ]

We don't anticipate that this will be a problem because of the project that made step down a part of the replication coordinator. 

Generated at Thu Feb 08 05:13:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.