[SERVER-76854] Revisit _configsvrSetAllowMigrations command use of sessions Created: 04/May/23  Updated: 05/May/23  Resolved: 05/May/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-76836 setAllowMigrations is executing remot... Closed
Assigned Teams:
Sharding EMEA
Operating System: ALL
Participants:

 Description   

The command should either:
(a) Not have any sessions attached to it.
(b) Not have the session checked out while running a blocking network call.
(c) Have a timeout so it is guaranteed to check back in the current session.

In the config shard setup, the following deadlock can occur:

shardA (also config server)
shardB

1. moveChunk from shardB to shardA.
2. shardA: Some ddl op calls sharding_ddl_util::stopMigrations. For example, in renameCollection, a session X is attached with the _configsvrSetAllowMigrations it sends out to the config server.
3. shardA (also config server): session X is checked out while running _configsvrSetAllowMigrations.
4. shardA: during session migration the destination encounters a session with id X, and tries to check it out, but is blocked because of _configsvrSetAllowMigrations.
5. shardA: _configsvrSetAllowMigrations calls _flushRoutingTableCacheUpdatesWithWriteConcern to all shards.
6. shardB: _flushRoutingTableCacheUpdatesWithWriteConcern waits for migration source to finish (via recoverRefresh -> wait for migration abort future)
7. shardB: as part of abort, it waits for _recvChunkReleaseCritSec to succeed. Since session migration is still ongoing on the destination, it will always return an error. But shardA is stuck because session migration is blocked waiting for _configsvrSetAllowMigrations to release the session.



 Comments   
Comment by Marcos José Grillo Ramirez [ 05/May/23 ]

Closing this ticket because option (b) is already underway in SERVER-76836

Generated at Thu Feb 08 06:33:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.