[SERVER-59965] Distributed deadlock between renameCollection and multi-shard transaction Created: 15/Sep/21 Updated: 29/Oct/23 Resolved: 25/Oct/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 5.2.0, 5.0.4, 5.1.0-rc3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jordi Serra Torrens | Assignee: | Jordi Serra Torrens |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Requested: |
v5.1, v5.0
|
||||||||||||||||||||||||||||
| Steps To Reproduce: | |||||||||||||||||||||||||||||
| Sprint: | Sharding EMEA 2021-09-20, Sharding EMEA 2021-10-04, Sharding EMEA 2021-10-18, Sharding EMEA 2021-11-01 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
As part of a sharded renameCollection, the DDLCoordinator instructs all participant shards to enter their critical sections. When all shards have entered it, the coordinator will do some work on the configsvr and finally it will tell the shards to leave their critical section. When running renameCollection concurrently with multi-shard transactions that affect that same collection, there exists a particular interleaving that can lead to a distributed deadlock: At this point we are deadlocked:
More generally, I believe this situation can occur in any DDL operation that needs to acquire the critical section in several nodes at the same time. I believe that resharding may also be affected by this. |
| Comments |
| Comment by Githook User [ 28/Oct/21 ] |
|
Author: {'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}Message: (cherry picked from commit 02add56a2100bef135281938a0cadaf374279f03) |
| Comment by Githook User [ 28/Oct/21 ] |
|
Author: {'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}Message: (cherry picked from commit 02add56a2100bef135281938a0cadaf374279f03) |
| Comment by Githook User [ 25/Oct/21 ] |
|
Author: {'name': 'Jordi Serra Torrens', 'email': 'jordi.serra-torrens@mongodb.com', 'username': 'jordist'}Message: |
| Comment by Jordi Serra Torrens [ 16/Sep/21 ] |
|
Proposal is to solve the deadlock by skipping this refresh (which blocks behind the critical section) in case we are in a transaction and the critical section is taken. The StaleConfig error will be propagated to the client with a TransientTransactionError label, so it will be retried. |