Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-36451

ContinuousStepdown with killing nodes can hang due to not being able to start the primary

    XMLWordPrintable

    Details

    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.0, v3.6
    • Sprint:
      TIG 2018-09-10
    • Linked BF Score:
      19
    • Story Points:
      2

      Description

      The replica_sets_kill_primary_jscore_passthrough tests occasionally timeout due waiting for a primary to be selected.

      The tests increase the election timeout to 24 hours to have control over which node is the leader. However, this can lead to a situation where the leader has been killed and both secondaries were unable to take over due to having stale oplogs. When the server is brought back up and attempts to stepup, there is a chance it has not yet heard back heartbeats from the other nodes in the cluster and assumes they are down. This means the stepup fails and another election is not attempted causing the test to eventually timeout.

      A possible solution, in the event of a failure would be to retry the stepup after some delay. This would allow the secondaries more time to respond to the heart beat request.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: