[SERVER-53612] StepDown hangs until timeout if all nodes are caught up but none is immediately electable Created: 06/Jan/21 Updated: 29/Oct/23 Resolved: 20/Jan/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0, 4.2.13, 4.4.5, 4.0.24 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Lingzhi Deng | Assignee: | Lingzhi Deng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v4.4, v4.2, v4.0
|
||||||||||||||||
| Sprint: | Repl 2021-01-25 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 21 | ||||||||||||||||
| Description |
|
For non-force stepDown, the TopologyCoordinator::tryToStartStepDown() loop in the stepDown code waits for two things - If either of these conditions is not met, we go into the loop body and wait for only (1) lastApplied being majority committed using the _replicationWaiterList. We only check waiters in the list if optime has advanced for at least one member. I guess the intention of the code might be that the majority wait will unblock again when optime of at least one member is changed so we don't need to busy loop on TopologyCoordinator::tryToStartStepDown() checking for condition 2. But this is problematic when all members have caught up (i.e. condition 1 is fully satisfied and no member's optime can advance any further) but we still have to wait for condition 2. We could add a _doneWaitingForReplication_inlock check before adding to the waiter list. This should work because I think it's part of the contract of the _replicationWaiterList that we should always check if the replication wait is done before adding to the waiter list. To be noted though, this will turn condition 2 into a busy-wait if condition 1 is satisfied before condition 2. But I think this is probably fine in practice. To make things little better before doing continue, we can make the stepdown thread to sleep for 10 milliseconds on an interruptible optCx while not holding the mutex lock. Ideally, we should have a different mechanism to wait for nodes to be electable. But it is probably not worth the complexity. |
| Comments |
| Comment by Githook User [ 07/Mar/21 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: (cherry picked from commit 6308db5c83a3e95f4532c63df8b635b8090036ae) |
| Comment by Githook User [ 17/Feb/21 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: (cherry picked from commit 6308db5c83a3e95f4532c63df8b635b8090036ae) |
| Comment by Githook User [ 17/Feb/21 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: (cherry picked from commit 6308db5c83a3e95f4532c63df8b635b8090036ae) |
| Comment by Githook User [ 20/Jan/21 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: |
| Comment by Jonathan Streets (Inactive) [ 07/Jan/21 ] |
|
I've increase the priority to blocker as we need to know how serious this issue is for 4.2.12-rc0. The BF was found in 4.2, please can someone clarify which other versions are affected. thanks, jon
|