[SERVER-43262] Devise general solution for BFs needing a higher stepdown interval in stepdown suites Created: 11/Sep/19 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Replication, Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Vesselina Ratcheva (Inactive) | Assignee: | Backlog - Server Tooling and Methods (STM) (Inactive) |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | shell-ipc | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Assigned Teams: |
Server Tooling & Methods
|
||||
| Participants: | |||||
| Linked BF Score: | 15 | ||||
| Description |
|
This ticket came out of a discussion we had on the replication team around a BF (linked) where a retryable write runs out of retries because nodes do not manage to catch-up in the amount of time given between stepdown cycles (in a heavy workload). While the solution to that specific BF could be to blacklist/downsize that workload, we hope to come up with a general solution (or policy) for what happens when the time given by stepdown cycles turns out to not be sufficient, so that we do not have to deal with that on a BF-by-BF basis. The purpose of this ticket is to serve as a place to link such BFs and to house a discussion on how we can more broadly address this class of failures. A few ideas that came up during one of our BF meetings:
Feel free to voice your opinions in the comments. |
| Comments |
| Comment by Steven Vannelli [ 10/May/22 ] |
|
Moving this ticket to the Backlog and removing the "Backlog" fixVersion as per our latest policy for using fixVersions. |
| Comment by Max Hirschhorn [ 11/Sep/19 ] |
If the operation is going to always take a certain amount of time to succeed, then more retries won't help because it'll keep getting interrupted by the stepdown. This is why the set7.js and max_doc_size.js tests were blacklisted from the retryable_writes_jscore_stepdown_passthrough.yml test suite as part of I think the ideal (but more complex to implement) would be for the stepdown to wait the maximum of the configured interval and however long it takes for the client to succeed in its retry. We currently lack a way for the mongo shell to signal to the stepdown thread that its retry of the interrupted operation has succeeded. |