Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-39846

Replica set deployment stuck in "could not find member to sync from" for over a minute after stepdown

    • Type: Icon: Bug Bug
    • Resolution: Works as Designed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.1.8
    • Component/s: Replication
    • Labels:
      None
    • ALL
    • Repl 2019-03-11, Repl 2019-03-25

      While writing the stepdown tests, I ran into an issue where the cluster would become unavailable for an extended period of time after a stepdown.

      The stepdown is executed by sending:

          admin_client.database.command(replSetStepDown: nil)
      

      I wrote a script to monitor the cluster, which you can find at https://github.com/p-mongo/tests/blob/master/stepdown-perf/status.rb. It produced the output at https://gist.github.com/p-mongo/01bdbcd59cbbcfff0ca01bace636b8fb. Significant events:

      • Stepdown requested at 2019-02-26 11:52:32 -0500
      • Immediately the old primary's info message becomes "could not find member to sync from"
      • At 2019-02-26 11:52:37 -0500 all nodes start showing "could not find member to sync from" as their status message
      • At 2019-02-26 11:52:43 -0500 one of the original secondaries has an empty status message, did it find a member to sync from?
      • In the next 10 seconds, various other nodes find members to sync from
      • At 2019-02-26 11:52:44 -0500, 27741 is syncing from 27743 while 27743 claims it has no member to sync from
      • At 2019-02-26 11:52:49 -0500 all nodes are again unable to find members to sync from
      • This condition persists until 2019-02-26 11:53:38 -0500
      • At 2019-02-26 11:53:41 -0500 27742 is a primary while claiming it is not able to find a node to sync from, was this a rollback then?

      Note that all nodes are running on the local machine and are remaining up throughout this entire process, therefore as far as I can tell network connectivity is not a factor.

      I also expect no data loss during this stepdown seeing how all of the nodes are up and should be available, and I am requesting stepdown via an admin command. The primary showing it is not able to find a node to sync from makes me suspect that this node may have rolled itself back and self-elected primary or something like that.

      This issue is currently blocking stepdown testing because waiting over a minute for a single election is impractical both interactively (i.e. during development of the tests) and programmatically (our hard per-test timeout is set at 45 seconds, and tests normally take under 5 seconds).

      Server: MongoDB server version: 4.1.8-73-ge2251dbc97

            Assignee:
            vesselina.ratcheva@mongodb.com Vesselina Ratcheva (Inactive)
            Reporter:
            oleg.pudeyev@mongodb.com Oleg Pudeyev (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: