[SERVER-76836] setAllowMigrations is executing remote calls with a session checked out Created: 04/May/23  Updated: 29/Oct/23  Resolved: 11/May/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Marcos José Grillo Ramirez Assignee: Marcos José Grillo Ramirez
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-76854 Revisit _configsvrSetAllowMigrations ... Closed
Problem/Incident
is caused by SERVER-73539 stopMigrations/resumeMigrations don't... Closed
Related
related to SERVER-76720 Chunk Migration migrates the session ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding EMEA 2023-05-15
Participants:
Linked BF Score: 110

 Description   

SERVER-73539 added replay protection to the setAllowMigrations command, however, this implies having a session checked out, while a refresh on all shards is happening. In the config shard setup, the following deadlock can occur:

shardA (also config server)
shardB

1. moveChunk from shardB to shardA.
2. shardA: Some ddl op calls sharding_ddl_util::stopMigrations. For example, in renameCollection, a session X is attached with the _configsvrSetAllowMigrations it sends out to the config server.
3. shardA (also config server): session X is checked out while running _configsvrSetAllowMigrations.
4. shardA: during session migration the destination encounters a session with id X, and tries to check it out, but is blocked because of _configsvrSetAllowMigrations.
5. shardA: _configsvrSetAllowMigrations calls _flushRoutingTableCacheUpdatesWithWriteConcern to all shards.
6. shardB: _flushRoutingTableCacheUpdatesWithWriteConcern waits for migration source to finish (via recoverRefresh -> wait for migration abort future)
7. shardB: as part of abort, it waits for _recvChunkReleaseCritSec to succeed. Since session migration is still ongoing on the destination, it will always return an error. But shardA is stuck because session migration is blocked waiting for _configsvrSetAllowMigrations to release the session.

We should do something similar to the transaction yielder, that is, yield the session while doing remote (or possible blocking) work.



 Comments   
Comment by Githook User [ 10/May/23 ]

Author:

{'name': 'Marcos José Grillo Ramirez', 'email': 'marcos.grillo@mongodb.com', 'username': 'm4nti5'}

Message: SERVER-76836 Yield session checked out in setAllowMigrations command before doing network request
Branch: master
https://github.com/mongodb/mongo/commit/2d19cbb4e585885e0581e89d68df1e52040a80f8

Generated at Thu Feb 08 06:33:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.