[SERVER-60488] Add a stress test for ReplicaSet shutdown/restart sequence that can freeze Created: 06/Oct/21  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 0
Labels: sharding-nyc-subteam2, sharding-wfbf-sprint
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Assigned Teams:
Cluster Scalability
Operating System: ALL
Participants:
Story Points: 3

 Description   

As reported:

Issue Description

What is the user seeing?

As part of our automated testing, we try to shut down all members of a csrs replica set. The final member does not shut down.

Where is it happening

Always the final csrs member in the set, which was a secondary.

(note - this does not happen every time)

When is it happening (timeline of events)

  1. 15:33:02.756 Automation steps down cs1 as primary (runs { {replSetStepDown: 120}

    )

  2. 15:33:02.782 Automation force shuts down cs1 (without waiting for a new primary to be elected) (using {shutdown 1} {force true})
  3. 15:33:02.763 Automation calls { {replSetStepDown: 120}

    ) on cs2 (note - we always call this, even though cs2 is not a primary)

  4. 15:33:02.795 Automation force shuts down cs2 (using {shutdown 1} {force true})
  5. 15:33:02.766 Automation calls { {replSetStepDown: 120}

    ) on cs3 (note - we always call this, even though cs3 is not a primary)

  6. 15:33:02.787 Automation tries to force shut down cs3 (using {shutdown 1} {force true})
  7. cs3 does not shut down

Generated at Thu Feb 08 05:49:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.