[SERVER-55573] Deadlock between stepdown and chunk migration Created: 26/Mar/21  Updated: 29/Oct/23  Resolved: 25/May/21

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: None
Fix Version/s: 4.4.7, 5.0.0-rc1, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Matthew Russotto Assignee: Pierlauro Sciarelli
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-sprint
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-57709 Make MigrationDestinationManager's in... Closed
related to SERVER-57756 Race between concurrent stepdowns and... Closed
related to SERVER-60161 Deadlock between config server stepdo... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0, v4.4
Participants:
Linked BF Score: 124

 Description   

When this deadlock occurs, the MigrationDestinationManager is holding the session checked out in what it calls "outerOpCtx". It then dispatches other threads with other opCtxs to do work on its behalf (in _migrateDriver()). Those opCtxs will not be killed by killSessions, because they do not have the session checked out. So what happens is

outerOpCtx holds session, but is not being used otherwise. In fact, it's not on a thread because an AlternativeClientRegion has been used.

Stepdown kills all user operations and all system operations marked to be killable on stepdown.

_migrateDriver() (either cloneDocuments or _applyMigrateOp) creates a new operation

Stepdown kills all sessions. But now we're stuck – the outerOpCtx doesn't receive the kill because it's swapped out of its thread. The new operation doesn't receive the kill because it's not associated with the session. The new operation gets stuck waiting for the RSTL, the stepdown thread gets stuck waiting for the session to be checked in, and we've got deadlock.

I can see a few ways to fix this. One way would be to officially allow opCtxs to do work on behalf of a session they didn't have checked out; they would then get kills delivered to them (and assigning an opCtx to an already-killed session would auto-kill it). The accounting might get ugly. We could also do something like PrimaryOnlyService does, which is basically the same only "manually" – register each opCtx created during migration somewhere. Then the outerOpCtx, instead of being swapped out, is waiting for a kill. When it gets it, it kills all registered opCtxs.

Or we could have the kill loop in shutdown time out if a session isn't killed in time, and loop back and kill the operations again. This is unelegant and runs the risk of livelock though.



 Comments   
Comment by Githook User [ 23/Jun/21 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'mao.cheahuychou@gmail.com', 'username': 'cheahuychou'}

Message: SERVER-55573 Deadlock between stepdown and chunk migration
Branch: v4.4
https://github.com/mongodb/mongo/commit/fbdfaa2530248b18b4327527f08d83eb283f67a2

Comment by Githook User [ 01/Jun/21 ]

Author:

{'name': 'Pierlauro Sciarelli', 'email': 'pierlauro.sciarelli@mongodb.com', 'username': 'pierlauro'}

Message: SERVER-55573 Deadlock between stepdown and chunk migration (BACKPORT-9251)
Branch: v5.0
https://github.com/mongodb/mongo/commit/6b50fd7d4bbdf4bb4c9a8dde055aa531f0780191

Comment by Githook User [ 25/May/21 ]

Author:

{'name': 'Pierlauro Sciarelli', 'email': 'pierlauro.sciarelli@mongodb.com', 'username': 'pierlauro'}

Message: SERVER-55573 Deadlock between stepdown and chunk migration
Branch: master
https://github.com/mongodb/mongo/commit/284ecabb7ec2d82cfc0f4b31090df4cfeb4c99b6

Comment by Esha Maharishi (Inactive) [ 05/Apr/21 ]

This was a good find, thank you matthew.russotto. I wanted to mention that shortly after releasing 4.4, we fixed two similar bugs (SERVER-48689, SERVER-48641) by making the _migrateDriver thread check in the session on "outerOpCtx" in the two places we had found that the _migrateDriver thread blocks on other threads.

We missed the place where _migrateDriver waits for the inserterThread to join. The same fix (having the _migrateDriver thread check in the session on "outerOpCtx" while blocking) will probably work, but it's a brittle solution because every such hole needs to be plugged individually. It may be worth finding a better solution.

CC jack.mulrow, since we worked on the current solution together.

Generated at Thu Feb 08 05:36:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.