[SERVER-55573] Deadlock between stepdown and chunk migration Created: 26/Mar/21 Updated: 29/Oct/23 Resolved: 25/May/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.7, 5.0.0-rc1, 5.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthew Russotto | Assignee: | Pierlauro Sciarelli |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-wfbf-sprint | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v5.0, v4.4
|
||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 124 | ||||||||||||||||||||||||
| Description |
|
When this deadlock occurs, the MigrationDestinationManager is holding the session checked out in what it calls "outerOpCtx". It then dispatches other threads with other opCtxs to do work on its behalf (in _migrateDriver()). Those opCtxs will not be killed by killSessions, because they do not have the session checked out. So what happens is outerOpCtx holds session, but is not being used otherwise. In fact, it's not on a thread because an AlternativeClientRegion has been used. Stepdown kills all user operations and all system operations marked to be killable on stepdown. _migrateDriver() (either cloneDocuments or _applyMigrateOp) creates a new operation Stepdown kills all sessions. But now we're stuck – the outerOpCtx doesn't receive the kill because it's swapped out of its thread. The new operation doesn't receive the kill because it's not associated with the session. The new operation gets stuck waiting for the RSTL, the stepdown thread gets stuck waiting for the session to be checked in, and we've got deadlock. I can see a few ways to fix this. One way would be to officially allow opCtxs to do work on behalf of a session they didn't have checked out; they would then get kills delivered to them (and assigning an opCtx to an already-killed session would auto-kill it). The accounting might get ugly. We could also do something like PrimaryOnlyService does, which is basically the same only "manually" – register each opCtx created during migration somewhere. Then the outerOpCtx, instead of being swapped out, is waiting for a kill. When it gets it, it kills all registered opCtxs. Or we could have the kill loop in shutdown time out if a session isn't killed in time, and loop back and kill the operations again. This is unelegant and runs the risk of livelock though. |
| Comments |
| Comment by Githook User [ 23/Jun/21 ] |
|
Author: {'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}Message: |
| Comment by Githook User [ 01/Jun/21 ] |
|
Author: {'name': 'Pierlauro Sciarelli', 'email': 'pierlauro.sciarelli@mongodb.com', 'username': 'pierlauro'}Message: |
| Comment by Githook User [ 25/May/21 ] |
|
Author: {'name': 'Pierlauro Sciarelli', 'email': 'pierlauro.sciarelli@mongodb.com', 'username': 'pierlauro'}Message: |
| Comment by Esha Maharishi (Inactive) [ 05/Apr/21 ] |
|
This was a good find, thank you matthew.russotto. I wanted to mention that shortly after releasing 4.4, we fixed two similar bugs ( We missed the place where _migrateDriver waits for the inserterThread to join. The same fix (having the _migrateDriver thread check in the session on "outerOpCtx" while blocking) will probably work, but it's a brittle solution because every such hole needs to be plugged individually. It may be worth finding a better solution. CC jack.mulrow, since we worked on the current solution together. |