[SERVER-53612] StepDown hangs until timeout if all nodes are caught up but none is immediately electable Created: 06/Jan/21  Updated: 29/Oct/23  Resolved: 20/Jan/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.9.0, 4.2.13, 4.4.5, 4.0.24

Type: Bug Priority: Major - P3
Reporter: Lingzhi Deng Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-35058 Don't only rely on heartbeat to signa... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2, v4.0
Sprint: Repl 2021-01-25
Participants:
Linked BF Score: 21

 Description   

For non-force stepDown, the TopologyCoordinator::tryToStartStepDown() loop in the stepDown code waits for two things -
1. the primary's lastApplied is majority committed and
2. one of the caught up node is electable.

If either of these conditions is not met, we go into the loop body and wait for only (1) lastApplied being majority committed using the _replicationWaiterList. We only check waiters in the list if optime has advanced for at least one member. I guess the intention of the code might be that the majority wait will unblock again when optime of at least one member is changed so we don't need to busy loop on TopologyCoordinator::tryToStartStepDown() checking for condition 2. But this is problematic when all members have caught up (i.e. condition 1 is fully satisfied and no member's optime can advance any further) but we still have to wait for condition 2. We could add a _doneWaitingForReplication_inlock check before adding to the waiter list. This should work because I think it's part of the contract of the _replicationWaiterList that we should always check if the replication wait is done before adding to the waiter list. To be noted though, this will turn condition 2 into a busy-wait if condition 1 is satisfied before condition 2. But I think this is probably fine in practice. To make things little better before doing continue, we can make the stepdown thread to sleep for 10 milliseconds on an interruptible optCx while not holding the mutex lock.

Ideally, we should have a different mechanism to wait for nodes to be electable. But it is probably not worth the complexity.



 Comments   
Comment by Githook User [ 07/Mar/21 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-53612: Fix StepDown hangs when all nodes are caught up but none is immediately electable

(cherry picked from commit 6308db5c83a3e95f4532c63df8b635b8090036ae)
Branch: v4.0
https://github.com/mongodb/mongo/commit/6e73ad25029ebd3a31017de1190d84de0ea73a15

Comment by Githook User [ 17/Feb/21 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-53612: Fix StepDown hangs when all nodes are caught up but none is immediately electable

(cherry picked from commit 6308db5c83a3e95f4532c63df8b635b8090036ae)
Branch: v4.2
https://github.com/mongodb/mongo/commit/4fb715053b3ad308c85501e9e9d0a1169bc78556

Comment by Githook User [ 17/Feb/21 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-53612: Fix StepDown hangs when all nodes are caught up but none is immediately electable

(cherry picked from commit 6308db5c83a3e95f4532c63df8b635b8090036ae)
Branch: v4.4
https://github.com/mongodb/mongo/commit/70168a9c71de72f9af42893418525fbfc94f3576

Comment by Githook User [ 20/Jan/21 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-53612: Fix StepDown hangs when all nodes are caught up but none is immediately electable
Branch: master
https://github.com/mongodb/mongo/commit/6308db5c83a3e95f4532c63df8b635b8090036ae

Comment by Jonathan Streets (Inactive) [ 07/Jan/21 ]

I've increase the priority to blocker as we need to know how serious this issue is for 4.2.12-rc0. The BF was found in 4.2,  please can someone clarify which other versions are affected. thanks, jon

 

Generated at Thu Feb 08 05:31:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.