[SERVER-57756] Race between concurrent stepdowns and applying transaction oplog entry Created: 16/Jun/21  Updated: 29/Oct/23  Resolved: 07/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.0.2, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Samyukta Lanka Assignee: Wenbin Zhu
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-58143 shardsvrDropCollectionParticipant sho... Closed
Duplicate
is duplicated by SERVER-57545 Stepping down while stepping up with ... Closed
Problem/Incident
is caused by SERVER-52564 Deadlock between step down and MongoD... Closed
Related
related to SERVER-59226 Deadlock when stepping down with a pr... Closed
is related to SERVER-55007 Deadlock between step down and MongoD... Closed
is related to SERVER-55573 Deadlock between stepdown and chunk m... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0, v4.4, v4.2
Sprint: Repl 2021-06-28, Repl 2021-07-12
Participants:
Case:
Linked BF Score: 120

 Description   

If a node is stepping down in multiple threads at once, one thread can lag behind the other stepdown thread and still be running when the node starts applying oplog entries as a secondary. In this case, if the remaining stepdown thread still has ScopedBlockSessionCheckouts in scope, it will block checking out sessions, which could cause us to fail to apply a commitTransaction oplog entry, ultimately triggering this fassert.



 Comments   
Comment by Githook User [ 21/Jul/21 ]

Author:

{'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}

Message: SERVER-57756 Remove ScopedBlockSessionCheckouts which is now unnecessary and causing race with concurrent stepdowns.

(cherry picked from commit 8588a5a3a52f17026b1e5a21c00e815fbb702a7c)
Branch: v5.0
https://github.com/mongodb/mongo/commit/cfa42ae2c4da873b42569921f09f4cee3a3cc12e

Comment by Githook User [ 07/Jul/21 ]

Author:

{'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}

Message: SERVER-57756 Remove ScopedBlockSessionCheckouts which is now unnecessary and causing race with concurrent stepdowns.
Branch: master
https://github.com/mongodb/mongo/commit/8588a5a3a52f17026b1e5a21c00e815fbb702a7c

Comment by Wenbin Zhu [ 01/Jul/21 ]

After talking to sharding team, I think we can revert SERVER-52564 after SERVER-58143  is done. 

Also as mentioned in the previous comment, the reason I think the 4.2/4.4 workaround is only a best effort is we can construct sequences that still produce deadlock. The workaround uses inserterOpCtx->setAlwaysInterruptAtStepDownOrUp(), but inserterOpCtx is not the one that checked out the session that stepdown is waiting on (recall this is a 3-way deadlock). Only the OpCtx that checked out a session will be marked killed by the stepdown thread. So if stepdown is run after this check, the 3-way deadlock can still happen. This is regardless of whether we backport SERVER-52564 to 4.2/4.4. So I think we can only do the backport after we introduce CancelableOperationContext to 4.2/4.4 which is currently in 5.0+.   

Comment by Wenbin Zhu [ 28/Jun/21 ]

renctan I think with CancelableOpCtx, the deadlock cannot happen because before stepdown thread waits for session checkout, it marked OpCtx as killed, and a CancelableOpCtx will also be killed/interrupted if it was created with a cancellation token from the OpCtx that is marked killed. So when the migration thread wants to acquire the RSTL while holding the session, it will fail because its OpCtx (cancellable) is marked killed with error code `InterruptedDueToReplStateChange`, breaking the deadlock. But drop collection is not interruptible, causing the deadlock issue again.

Comment by Randolph Tan [ 28/Jun/21 ]

Hm... Wouldn't CancelableOpCtx also run into the same issue you mentioned about the stepdown thread requiring to wait for checked out sessions? Since the stepDown thread can't check the session, it won't be able to kill it.

Comment by Wenbin Zhu [ 28/Jun/21 ]

Because of this issue, we started re-investigating BF-19260 which introduced SERVER-52564 that caused this issue. It turns out to be a 3-way deadlock: step down thread taking the RSTL and waiting for session checkout in order to kill it, but the session is checked out by the migration thread, which spawns another OpCtx/thread to do chunk insertion and waiting on inserter thread to finish. However inserter thread needs to take RSTL lock, and thus a 3-way deadlock is formed because even though the step down thread interrupted/killed the OpCtx it's waiting on, which is the migration thread, the inserter thread is not interrupted/killed because it's newly spawned.

This deadlock was also discovered in SERVER-55573 due to another BF and fixed by using a CancelableOperationContext so that the inserter thread will also be interrupted/killed when the migration thread was interrupted/killed, thus breaking the deadlock.

That fix made us rethink if SERVER-52564 is still needed. I think that SERVER-52564 only reduces the possibility of the initial problem because another thread could checkout the session before disallowing session checkout, and when the stepdown thread is killing the operations after it acquires the RSTL, it can still be blocked on waiting for session checkout in order to kill it. So we think SERVER-52564 only reduces the possibility of the issue and should be able to be reverted due to the fix by CancelableOperationContext

I ran some patch builds after reverting SERVER-52564, and due the increased possibility, another deadlock showed up. This time the stepdown thread is waiting for the session checked out by ShardsvrDropCollectionParticipantCommand, which is taking RSTL lock to finish drop collection. However drop collection is not interruptible, thus even the CancelableOperationContext fix cannot break the deadlock. So my question is whether it's necessary for drop collection to be uninterruptible? If it is then we have to think of other solutions to break it.

renctan pierlauro.sciarelli any thoughts on this?

Comment by Wenbin Zhu [ 22/Jun/21 ]

I think there is another problem with ScopedBlockSessionCheckouts  in case of concurrent stepdown.

  1. Thread 1 enters the ScopedBlockSessionCheckouts block and sets _checkoutAllowed to false.
  2. Thread 2 also enters this block and sets _checkoutAllowed to false again.
  3. Thread 1 exits this block and sets _checkoutAllowed to true.

Now even though thread 2 is still inside this block, we again allow checking out sessions because step 3 resets the flag, which is against the purpose of ScopedBlockSessionCheckouts. If this happens, the deadlock problem in SERVER-52564  (which is why ScopedBlockSessionCheckouts was introduced in the first place) might happen again. 

This problem seems to have a simple fix by using a counter instead of the boolean value for _checkoutAllowed, but we might need a cleaner solution that can handle both problems.

Comment by Steven Vannelli [ 16/Jun/21 ]

samy.lanka I'm going through and requesting 5.0 Backports for all 5.0 Hot BFs in WFBF. Feel free to update / remove post-rc0 label as you see fit.

Generated at Thu Feb 08 05:42:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.