[SERVER-81226] Deadlock between _commitSplitPreparedTxnOnPrimary and stepDown Created: 19/Sep/23  Updated: 29/Oct/23  Resolved: 05/Oct/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.2.0-rc0, 7.0.3

Type: Bug Priority: Major - P3
Reporter: Lingzhi Deng Assignee: Wenbin Zhu
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-79950 Fix commitPreparedTransaction to not ... Closed
Assigned Teams:
Replication
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0
Sprint: Repl 2023-10-02, Repl 2023-10-16
Participants:
Linked BF Score: 120

 Description   

When we loop through all split sessions, it is possible that some splitOpCtx may successfully acquire the RSTL while others are blocked behind stepDown (e.g. when stepDown happens in the middle of this loop). SERVER-79950 added UninterruptibleLockGuard to all splitOpCtx and so _commitSplitPreparedTxnOnPrimary is not interruptible by stepDown. In the case mentioned above, if some but not all of the splitOpCtx in _commitSplitPreparedTxnOnPrimary already hold RSTL in IX mode, it will block stepDown while not being able to make progress. Thus, we have a deadlock.



 Comments   
Comment by Githook User [ 11/Oct/23 ]

Author:

{'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}

Message: SERVER-81226 Do not acquire RSTL for commit and abort of split transactions.
Branch: v7.0
https://github.com/mongodb/mongo/commit/a463662ab2bf1b43008de7cc2f5bbee77d4edd34

Comment by Githook User [ 05/Oct/23 ]

Author:

{'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}

Message: SERVER-81226 Do not acquire RSTL for commit and abort of split transactions.
Branch: master
https://github.com/mongodb/mongo/commit/ada003749da8ec582c7bce1e5d97ca7e1b91362b

Comment by Wenbin Zhu [ 29/Sep/23 ]

So here is what happened: I think I was right that split transactions do not acquire RSTL as part of unstashTransactionResources, however split transactions manually acquire a GlobalLock after unstashTransactionResources and acquiring the GlobalLock involves acquiring an RSTL. I think we should either use the skipRSTLLock option to avoid acquiring RSTL in this case or maybe we don't need to acquire the GlobalLock in the first place.

Abort has the same issue as the split transactions call newTxnParticipant.abortTransaction() after unstash, which also acquires RSTL here. I think ideally we should  manually abort the storage txn and cleanup  (similar to the commit case) instead of using abortTransaction(), but I vaguely remember DanG mentioned he didn't do that because of some reason.

Comment by Wenbin Zhu [ 28/Sep/23 ]

After giving this a second thought, I'm not sure why the splitOpCtxs needs to acquire RSTL in the first place. When we prepare a transaction, no matter on primary or secondary, we always release the RSTL at the end of prepare, so I when unstashing the split transaction for commit, I think there should be no RSTL in the stashed lock snapshot? I will take a look at the original BF to see if it's something else.

Comment by Wenbin Zhu [ 20/Sep/23 ]

Since on primary we apply the split transactions serially, I'm wondering if it is possible to reuse the parent OpCtx's locker for the child OpCtx? I'm not sure if the current locker contract allows this kind of pattern though.

Comment by Lingzhi Deng [ 19/Sep/23 ]

Since we know that the thread committing the prepared transaction must already hold RSTL in IX mode and it's uninterruptible, maybe the split sessions can use the compatibleFirst option to acquire their RSTL locks.

Additionally, I think the split sessions technically don't need to acquire the RSTL again given that it's already acquired at the top level. I wonder if we did that only because we needed to satisfy certain invariants down the line.

Generated at Thu Feb 08 06:45:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.