[SERVER-66825] Deadlock on migration recipient stepdown Created: 27/May/22  Updated: 29/Oct/23  Resolved: 01/Jun/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 6.0.0-rc6
Fix Version/s: 6.0.0-rc9, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Jordi Serra Torrens Assignee: Jordi Serra Torrens
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.0
Sprint: Sharding EMEA 2022-05-30, Sharding EMEA 2022-06-13
Participants:
Linked BF Score: 144

 Description   

On stepdown, while the replication coordinator is holding the RSTL lock in exclusive mode, MigrationDestinationManager::onStepDown() is called. This method takes the MigrationDestinationManager::_mutex.

On the other side, MigrationDestinationManager::exitCriticalSection() first takes the same mutex, and later takes the lock hierarchy (here or here) which includes the RSTL lock.

This inverted lock acquisition order can cause a deadlock on stepdown when a migration is interrupted if a particular interleaving occurs.



 Comments   
Comment by Githook User [ 02/Jun/22 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-66825 Fix deadlock on MigrationDestinationManager::onStepDown

(cherry picked from commit 589d65d55b880aa28803e4b4b9f109dd9e5b60bc)
Branch: v6.0
https://github.com/mongodb/mongo/commit/60f659abea5b129f538106b1fe3f6dfc8b621ffa

Comment by Githook User [ 01/Jun/22 ]

Author:

{'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}

Message: SERVER-66825 Fix deadlock on MigrationDestinationManager::onStepDown
Branch: master
https://github.com/mongodb/mongo/commit/589d65d55b880aa28803e4b4b9f109dd9e5b60bc

Comment by Suganthi Mani [ 27/May/22 ]

jordi.serra-torrens@mongodb.com  Just an FYI,  we had a similar deadlock issues in tenant migration code due to the lock order violation between RSTL and instance mutex lock. We are planning to address that as part of PM-2660 (Primary Only Service Improvements). Currently Service arch team is collecting the POS pain-points. I have already added interruption logic pain point leading to deadlocks here. I am going to link this ticket as well in the doc to prioritize the interruption pain-point.

Generated at Thu Feb 08 06:06:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.