Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-53612

StepDown hangs until timeout if all nodes are caught up but none is immediately electable

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.9.0, 4.2.13, 4.4.5, 4.0.24
    • Component/s: None
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.4, v4.2, v4.0
    • Sprint:
      Repl 2021-01-25
    • Linked BF Score:
      21

      Description

      For non-force stepDown, the TopologyCoordinator::tryToStartStepDown() loop in the stepDown code waits for two things -
      1. the primary's lastApplied is majority committed and
      2. one of the caught up node is electable.

      If either of these conditions is not met, we go into the loop body and wait for only (1) lastApplied being majority committed using the _replicationWaiterList. We only check waiters in the list if optime has advanced for at least one member. I guess the intention of the code might be that the majority wait will unblock again when optime of at least one member is changed so we don't need to busy loop on TopologyCoordinator::tryToStartStepDown() checking for condition 2. But this is problematic when all members have caught up (i.e. condition 1 is fully satisfied and no member's optime can advance any further) but we still have to wait for condition 2. We could add a _doneWaitingForReplication_inlock check before adding to the waiter list. This should work because I think it's part of the contract of the _replicationWaiterList that we should always check if the replication wait is done before adding to the waiter list. To be noted though, this will turn condition 2 into a busy-wait if condition 1 is satisfied before condition 2. But I think this is probably fine in practice. To make things little better before doing continue, we can make the stepdown thread to sleep for 10 milliseconds on an interruptible optCx while not holding the mutex lock.

      Ideally, we should have a different mechanism to wait for nodes to be electable. But it is probably not worth the complexity.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              lingzhi.deng Lingzhi Deng
              Reporter:
              lingzhi.deng Lingzhi Deng
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: