-
Type: Bug
-
Resolution: Works as Designed
-
Priority: Major - P3
-
None
-
Affects Version/s: 4.1.8
-
Component/s: Replication
-
None
-
ALL
-
Repl 2019-03-11, Repl 2019-03-25
While writing the stepdown tests, I ran into an issue where the cluster would become unavailable for an extended period of time after a stepdown.
The stepdown is executed by sending:
admin_client.database.command(replSetStepDown: nil)
I wrote a script to monitor the cluster, which you can find at https://github.com/p-mongo/tests/blob/master/stepdown-perf/status.rb. It produced the output at https://gist.github.com/p-mongo/01bdbcd59cbbcfff0ca01bace636b8fb. Significant events:
- Stepdown requested at 2019-02-26 11:52:32 -0500
- Immediately the old primary's info message becomes "could not find member to sync from"
- At 2019-02-26 11:52:37 -0500 all nodes start showing "could not find member to sync from" as their status message
- At 2019-02-26 11:52:43 -0500 one of the original secondaries has an empty status message, did it find a member to sync from?
- In the next 10 seconds, various other nodes find members to sync from
- At 2019-02-26 11:52:44 -0500, 27741 is syncing from 27743 while 27743 claims it has no member to sync from
- At 2019-02-26 11:52:49 -0500 all nodes are again unable to find members to sync from
- This condition persists until 2019-02-26 11:53:38 -0500
- At 2019-02-26 11:53:41 -0500 27742 is a primary while claiming it is not able to find a node to sync from, was this a rollback then?
Note that all nodes are running on the local machine and are remaining up throughout this entire process, therefore as far as I can tell network connectivity is not a factor.
I also expect no data loss during this stepdown seeing how all of the nodes are up and should be available, and I am requesting stepdown via an admin command. The primary showing it is not able to find a node to sync from makes me suspect that this node may have rolled itself back and self-elected primary or something like that.
This issue is currently blocking stepdown testing because waiting over a minute for a single election is impractical both interactively (i.e. during development of the tests) and programmatically (our hard per-test timeout is set at 45 seconds, and tests normally take under 5 seconds).
Server: MongoDB server version: 4.1.8-73-ge2251dbc97
- is depended on by
-
RUBY-1572 Connections survive primary step down POC
- Closed