[SERVER-59226] Deadlock when stepping down with a profile session marked as uninterruptible Created: 10/Aug/21 Updated: 29/Oct/23 Resolved: 04/Sep/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.11, 4.2.18, 5.0.4, 5.1.0-rc0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Vishnu Kaushik | Assignee: | Wenbin Zhu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||
| Backport Requested: |
v5.0, v4.4, v4.2, v4.0
|
||||||||||||||||||||||||||||||||
| Sprint: | Repl 2021-08-23, Repl 2021-09-06 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||
| Description |
|
When a primary is stepping down, when calling _stepDownFinish it kills all sessions through invalidateSessionsForStepDown -> killSessionsAction -> checkOutSessionForKill. This is after it has grabbed the RSTL via AutoGetRstlForStepUpStepDown. In the process of stepping down, if it is trying to kill an already checked out session, there is a potential for deadlock, as it needs to wait until the session is checked back in. The session will end up getting interrupted when it tries to grab a lock such as the GlobalLock. However, if the session's opCtx is marked as uninterruptible, then it is possible that the checked out session is waiting on the GlobalLock, while the step down thread (which has the RSTL) is waiting on the checked out session, causing a deadlock. This is possible when profiling. In general, it may be possible with other uses of the UninterruptibleLockGuard. |
| Comments |
| Comment by Githook User [ 01/Nov/21 ] |
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: Revert " This reverts commit d1a9e68d581f8e3162bc790bdb30986b8d030d86. |
| Comment by Githook User [ 01/Nov/21 ] |
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: (cherry picked from commit 43479818bd01f27ee25b6e992045529d2ac0185a) |
| Comment by Githook User [ 21/Oct/21 ] |
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: (cherry picked from commit 43479818bd01f27ee25b6e992045529d2ac0185a) |
| Comment by Githook User [ 20/Oct/21 ] |
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: (cherry picked from commit 43479818bd01f27ee25b6e992045529d2ac0185a) |
| Comment by Githook User [ 14/Oct/21 ] |
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: |
| Comment by Githook User [ 04/Sep/21 ] |
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: |
| Comment by Wenbin Zhu [ 30/Aug/21 ] |
|
After discussing with execution team, we decided to push the fix for now, and later investigate the ideal solutions, such as removing RSTL for non-replicated write and profiling in a separate thread/executor. I have created |
| Comment by Andy Schwerin [ 26/Aug/21 ] |
|
OperationContext shouldn't have replication-specific member fields. It's a layer violation, as can be seen by the fact that there are no replication concepts in mongos. |
| Comment by Lingzhi Deng [ 26/Aug/21 ] |
|
I think we used UninterruptibleLockGuard because we want to output the profiler entry to system.profile even if the opCtx is interrupted (e.g. the operation times out). system.profile is a non-replicate collection. And ideally, we don't actually need to acquire RSTL for non-replicate writes and writes on system.profile shouldn't get in the way of repl state transitions. But it might be difficult to skip RSTL in the write paths today. So wenbin.zhu's patch instead is to remove the UninterruptibleLockGuard but add a flag to opCtx to ignore interruptions except InterruptedDueToReplStateChange. |
| Comment by Andy Schwerin [ 26/Aug/21 ] |
|
Why is the profiling code uninterruptible in the first place? |
| Comment by Wenbin Zhu [ 18/Aug/21 ] |
|
And to answer josef.ahmad's question, I believe 4.0 is also impacted, we should be able to have a 4.0 backport. |
| Comment by Wenbin Zhu [ 17/Aug/21 ] |
|
I have successfully reproduced this issue in test. This only happens to retryable writes as mentioned in the previous comment. But note that only retryable updates and deletes are affected, not for inserts. This is because for inserts, the profile operation is called in `service_entry_point_common.cpp`, while for updates and deletes, the profile operation is called in `write_ops_exec.cpp`. The difference is that for inserts, when profile() is called, the session has already been checked in, so it will not block the stepdown thread on checking out session even though the opCtx is marked uninterruptible, while for updates and deletes, the session is still in checked out state when profile() is being called, which would cause deadlocks with stepdown thread. I'm wondering if updates and deletes can move the profile functionality after checking back in session, like insert does, as an optimization, not related to this ticket though. |
| Comment by Wenbin Zhu [ 16/Aug/21 ] |
|
This seems to only happen with retryable writes, not with transactions, because in transactions, the opCtx's locker would have the maxLockTimeout set (the `maxTransactionLockRequestTimeoutMillis` server parameter and by default 5ms). And when profiling, if the maxLockTimeout is set, we don't make the opCtx uninterruptible. Will try to reproduce the issue. The proposed fix could be making the opCtx interruptible when maxLockTimeout is not set and retry if the operation is interrupted until success, but don't retry when maxLockTimeout is set to keep the same behavior as today. |