[SERVER-43262] Devise general solution for BFs needing a higher stepdown interval in stepdown suites Created: 11/Sep/19  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication, Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Vesselina Ratcheva (Inactive) Assignee: Backlog - Server Tooling and Methods (STM) (Inactive)
Resolution: Unresolved Votes: 0
Labels: shell-ipc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Assigned Teams:
Server Tooling & Methods
Participants:
Linked BF Score: 15

 Description   

This ticket came out of a discussion we had on the replication team around a BF (linked) where a retryable write runs out of retries because nodes do not manage to catch-up in the amount of time given between stepdown cycles (in a heavy workload). While the solution to that specific BF could be to blacklist/downsize that workload, we hope to come up with a general solution (or policy) for what happens when the time given by stepdown cycles turns out to not be sufficient, so that we do not have to deal with that on a BF-by-BF basis.

The purpose of this ticket is to serve as a place to link such BFs and to house a discussion on how we can more broadly address this class of failures.

A few ideas that came up during one of our BF meetings:

  • Increase the stepdown interval across the board.
  • Make the interval configurable from one variant to another (so we can increase for slow variants).
  • Wait more between retries of operations in our tests and/or increase their deadlines.

Feel free to voice your opinions in the comments.



 Comments   
Comment by Steven Vannelli [ 10/May/22 ]

Moving this ticket to the Backlog and removing the "Backlog" fixVersion as per our latest policy for using fixVersions.

Comment by Max Hirschhorn [ 11/Sep/19 ]

Wait more between retries of operations in our tests and/or increase their deadlines.

If the operation is going to always take a certain amount of time to succeed, then more retries won't help because it'll keep getting interrupted by the stepdown. This is why the set7.js and max_doc_size.js tests were blacklisted from the retryable_writes_jscore_stepdown_passthrough.yml test suite as part of SERVER-37071.

I think the ideal (but more complex to implement) would be for the stepdown to wait the maximum of the configured interval and however long it takes for the client to succeed in its retry. We currently lack a way for the mongo shell to signal to the stepdown thread that its retry of the interrupted operation has succeeded.

Generated at Thu Feb 08 05:02:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.