[SERVER-66825] Deadlock on migration recipient stepdown Created: 27/May/22 Updated: 29/Oct/23 Resolved: 01/Jun/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 6.0.0-rc6 |
| Fix Version/s: | 6.0.0-rc9, 6.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jordi Serra Torrens | Assignee: | Jordi Serra Torrens |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v6.0
|
||||||||
| Sprint: | Sharding EMEA 2022-05-30, Sharding EMEA 2022-06-13 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 144 | ||||||||
| Description |
|
On stepdown, while the replication coordinator is holding the RSTL lock in exclusive mode, MigrationDestinationManager::onStepDown() is called. This method takes the MigrationDestinationManager::_mutex. On the other side, MigrationDestinationManager::exitCriticalSection() first takes the same mutex, and later takes the lock hierarchy (here or here) which includes the RSTL lock. This inverted lock acquisition order can cause a deadlock on stepdown when a migration is interrupted if a particular interleaving occurs. |
| Comments |
| Comment by Githook User [ 02/Jun/22 ] |
|
Author: {'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}Message: (cherry picked from commit 589d65d55b880aa28803e4b4b9f109dd9e5b60bc) |
| Comment by Githook User [ 01/Jun/22 ] |
|
Author: {'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}Message: |
| Comment by Suganthi Mani [ 27/May/22 ] |
|
jordi.serra-torrens@mongodb.com Just an FYI, we had a similar deadlock issues in tenant migration code due to the lock order violation between RSTL and instance mutex lock. We are planning to address that as part of PM-2660 (Primary Only Service Improvements). Currently Service arch team is collecting the POS pain-points. I have already added interruption logic pain point leading to deadlocks here. I am going to link this ticket as well in the doc to prioritize the interruption pain-point. |