[SERVER-48641] Deadlock due to the MigrationDestinationManager waiting for write concern with the session checked-out Created: 08/Jun/20 Updated: 29/Oct/23 Resolved: 16/Jul/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.4.0-rc8 |
| Fix Version/s: | 4.4.1, 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Jack Mulrow |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | KP44 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||
| Sprint: | Sharding 2020-07-13, Sharding 2020-07-27 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 40 | ||||||||||||||||||||||||
| Description |
|
The MigrationDestinationManager checks-out a session and then proceeds executing the recipient logic while that session is checked-out. The execution logic at some point reaches to a call to waitForWriteConcern which runs with the session still checked-out. Because the JournalFlusher wait is non-interruptible (and also because
|
| Comments |
| Comment by Githook User [ 12/Aug/20 ] |
|
Author: {'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}Message: (cherry picked from commit 21b083c7352704fc8c3d8a4f33c54040259ff766) |
| Comment by Tess Avitabile (Inactive) [ 27/Jul/20 ] |
|
Great, thank you! |
| Comment by Esha Maharishi (Inactive) [ 27/Jul/20 ] |
|
kaloian.manassiev yes, they are two different deadlocks that had the same root cause. Both deadlocks should only have existed on 4.4, since they were due to code introduced to the MigrationDestinationManager in 4.4. |
| Comment by Kaloian Manassiev [ 27/Jul/20 ] |
|
This specific bug is entirely new for 4.4, so it has no effect on 4.2. I don't understand the difference between it and esha.maharishi, I think Jack is on vacation - can you confirm my understanding? |
| Comment by Tess Avitabile (Inactive) [ 27/Jul/20 ] |
|
Does this affect 4.2? We need to backport |
| Comment by Githook User [ 16/Jul/20 ] |
|
Author: {'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}Message: |
| Comment by Jack Mulrow [ 08/Jul/20 ] |
|
alex.taskov, matthew.saltz, what do you think of the following proposed fix? (Tagging you both since you were on the resumable range deleter project. Also CC esha.maharishi for when she's back from vacation.) As far as I can tell, the reasons we check out the session for the entire recipient logic is to both detect when a _recvChunkStart begins after a migration has already finished and been cleaned up (due to a split brain) and so when the transaction number on the recipient is advanced as part of recovering a migration, the number can only be advanced before or after all of the recipient logic, so the recovery can safely delete the range deletion document on the recipient and trigger a range deletion (otherwise orphans from an active cloning phase might be inserted after the deletion). Am I missing any reasons? If that's true, then I think we can fix this problem (and
What do you guys think? I slightly prefer approach 1), since I expect it would be easier to implement, although it might be trickier to test. |