Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.1.8
Component/s: Replication
Labels:
None

Operating System:
ALL
Sprint:
Repl 2019-03-11, Repl 2019-03-25
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

While writing the stepdown tests, I ran into an issue where the cluster would become unavailable for an extended period of time after a stepdown.

The stepdown is executed by sending:

    admin_client.database.command(replSetStepDown: nil)

I wrote a script to monitor the cluster, which you can find at https://github.com/p-mongo/tests/blob/master/stepdown-perf/status.rb. It produced the output at https://gist.github.com/p-mongo/01bdbcd59cbbcfff0ca01bace636b8fb. Significant events:

Stepdown requested at 2019-02-26 11:52:32 -0500
Immediately the old primary's info message becomes "could not find member to sync from"
At 2019-02-26 11:52:37 -0500 all nodes start showing "could not find member to sync from" as their status message
At 2019-02-26 11:52:43 -0500 one of the original secondaries has an empty status message, did it find a member to sync from?
In the next 10 seconds, various other nodes find members to sync from
At 2019-02-26 11:52:44 -0500, 27741 is syncing from 27743 while 27743 claims it has no member to sync from
At 2019-02-26 11:52:49 -0500 all nodes are again unable to find members to sync from
This condition persists until 2019-02-26 11:53:38 -0500
At 2019-02-26 11:53:41 -0500 27742 is a primary while claiming it is not able to find a node to sync from, was this a rollback then?

Note that all nodes are running on the local machine and are remaining up throughout this entire process, therefore as far as I can tell network connectivity is not a factor.

I also expect no data loss during this stepdown seeing how all of the nodes are up and should be available, and I am requesting stepdown via an admin command. The primary showing it is not able to find a node to sync from makes me suspect that this node may have rolled itself back and self-elected primary or something like that.

This issue is currently blocking stepdown testing because waiting over a minute for a single election is impractical both interactively (i.e. during development of the tests) and programmatically (our hard per-test timeout is set at 45 seconds, and tests normally take under 5 seconds).

Server: MongoDB server version: 4.1.8-73-ge2251dbc97

is depended on by

RUBY-1572 Connections survive primary step down POC

Closed

Assignee:: Vesselina Ratcheva (Inactive)
Reporter:: Oleg Pudeyev (Inactive)
Participants:: Kaloian Manassiev, Oleg Pudeyev, Vesselina Ratcheva
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Feb 26 2019 05:30:10 PM UTC
Updated:: Oct 27 2023 01:53:17 PM UTC
Resolved:: Mar 07 2019 10:05:15 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates