[SERVER-39846] Replica set deployment stuck in "could not find member to sync from" for over a minute after stepdown Created: 26/Feb/19 Updated: 27/Oct/23 Resolved: 07/Mar/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 4.1.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Oleg Pudeyev (Inactive) | Assignee: | Vesselina Ratcheva (Inactive) |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Repl 2019-03-11, Repl 2019-03-25 | ||||||||
| Participants: | |||||||||
| Description |
|
While writing the stepdown tests, I ran into an issue where the cluster would become unavailable for an extended period of time after a stepdown. The stepdown is executed by sending:
I wrote a script to monitor the cluster, which you can find at https://github.com/p-mongo/tests/blob/master/stepdown-perf/status.rb. It produced the output at https://gist.github.com/p-mongo/01bdbcd59cbbcfff0ca01bace636b8fb. Significant events:
Note that all nodes are running on the local machine and are remaining up throughout this entire process, therefore as far as I can tell network connectivity is not a factor. I also expect no data loss during this stepdown seeing how all of the nodes are up and should be available, and I am requesting stepdown via an admin command. The primary showing it is not able to find a node to sync from makes me suspect that this node may have rolled itself back and self-elected primary or something like that. This issue is currently blocking stepdown testing because waiting over a minute for a single election is impractical both interactively (i.e. during development of the tests) and programmatically (our hard per-test timeout is set at 45 seconds, and tests normally take under 5 seconds). Server: MongoDB server version: 4.1.8-73-ge2251dbc97 |
| Comments |
| Comment by Oleg Pudeyev (Inactive) [ 07/Mar/19 ] |
|
Thanks for the help. |
| Comment by Oleg Pudeyev (Inactive) [ 26/Feb/19 ] |
|
No shards, this is a 3 node replica set. Created with: https://github.com/p-mongo/dev/blob/master/script/launch-4.1-replset and https://github.com/p-mongo/dev/blob/master/script/launch.sh. |
| Comment by Kaloian Manassiev [ 26/Feb/19 ] |
|
oleg.pudeyev, the title/description says "cluster", but the text talks about primary. Is this a primary of a shard replica-set in a cluster? |