[SERVER-39846] Replica set deployment stuck in "could not find member to sync from" for over a minute after stepdown Created: 26/Feb/19  Updated: 27/Oct/23  Resolved: 07/Mar/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.1.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Oleg Pudeyev (Inactive) Assignee: Vesselina Ratcheva (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by RUBY-1572 Connections survive primary step down... Closed
Operating System: ALL
Sprint: Repl 2019-03-11, Repl 2019-03-25
Participants:

 Description   

While writing the stepdown tests, I ran into an issue where the cluster would become unavailable for an extended period of time after a stepdown.

The stepdown is executed by sending:

    admin_client.database.command(replSetStepDown: nil)

I wrote a script to monitor the cluster, which you can find at https://github.com/p-mongo/tests/blob/master/stepdown-perf/status.rb. It produced the output at https://gist.github.com/p-mongo/01bdbcd59cbbcfff0ca01bace636b8fb. Significant events:

  • Stepdown requested at 2019-02-26 11:52:32 -0500
  • Immediately the old primary's info message becomes "could not find member to sync from"
  • At 2019-02-26 11:52:37 -0500 all nodes start showing "could not find member to sync from" as their status message
  • At 2019-02-26 11:52:43 -0500 one of the original secondaries has an empty status message, did it find a member to sync from?
  • In the next 10 seconds, various other nodes find members to sync from
  • At 2019-02-26 11:52:44 -0500, 27741 is syncing from 27743 while 27743 claims it has no member to sync from
  • At 2019-02-26 11:52:49 -0500 all nodes are again unable to find members to sync from
  • This condition persists until 2019-02-26 11:53:38 -0500
  • At 2019-02-26 11:53:41 -0500 27742 is a primary while claiming it is not able to find a node to sync from, was this a rollback then?

Note that all nodes are running on the local machine and are remaining up throughout this entire process, therefore as far as I can tell network connectivity is not a factor.

I also expect no data loss during this stepdown seeing how all of the nodes are up and should be available, and I am requesting stepdown via an admin command. The primary showing it is not able to find a node to sync from makes me suspect that this node may have rolled itself back and self-elected primary or something like that.

This issue is currently blocking stepdown testing because waiting over a minute for a single election is impractical both interactively (i.e. during development of the tests) and programmatically (our hard per-test timeout is set at 45 seconds, and tests normally take under 5 seconds).

Server: MongoDB server version: 4.1.8-73-ge2251dbc97



 Comments   
Comment by Oleg Pudeyev (Inactive) [ 07/Mar/19 ]

Thanks for the help.

Comment by Oleg Pudeyev (Inactive) [ 26/Feb/19 ]

No shards, this is a 3 node replica set.

Created with: https://github.com/p-mongo/dev/blob/master/script/launch-4.1-replset and https://github.com/p-mongo/dev/blob/master/script/launch.sh.

Comment by Kaloian Manassiev [ 26/Feb/19 ]

oleg.pudeyev, the title/description says "cluster", but the text talks about primary. Is this a primary of a shard replica-set in a cluster?

Generated at Thu Feb 08 04:53:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.