[SERVER-58619] Continuous Stepdown's replSetStepDown Is Not Resilient To External Elections Created: 16/Jul/21  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Luis Osta (Inactive) Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 0
Labels: sharding-csrs-stepdown-upkeep
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-59891 Replace the coverage from sharding_co... Backlog
Assigned Teams:
Cluster Scalability
Operating System: ALL
Participants:
Linked BF Score: 0

 Description   

TheĀ  _continuousPrimaryStepdownFn function is part of a background thread that will continually stepdown the primary. It keeps a reference of the latest primary in memory and updates each time after stepping down the old primary.

This setup means that if there was an election after it decided to update what it thinks the primary is, it will have an old reference of what the primary is. Hence, next time it attempts to stepdown the primary it will have a network error.

In order to solve this we should wrap the command execution below in a trycatch. If its a network then swallow the exception, and otherwise rethrow the error. To be handled by the higher up trycatch.

When swallowing the exception make sure to print out its ocurrence.

                assert.commandWorkedOrFailedWithCode(
                    primary.adminCommand(
                        {replSetStepDown: options.stepdownDurationSecs, force: true}),
                    [ErrorCodes.NotWritablePrimary, ErrorCodes.ConflictingOperationInProgress]);



 Comments   
Comment by Max Hirschhorn [ 11/Sep/21 ]

Hoping to not do this ticket and to do SERVER-59891 instead.

Generated at Thu Feb 08 05:45:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.