Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Replication, Testing Infrastructure
Labels:
- shell-ipc

Assigned Teams:

Server Tooling & Methods
Linked BF Score:
15
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

This ticket came out of a discussion we had on the replication team around a BF (linked) where a retryable write runs out of retries because nodes do not manage to catch-up in the amount of time given between stepdown cycles (in a heavy workload). While the solution to that specific BF could be to blacklist/downsize that workload, we hope to come up with a general solution (or policy) for what happens when the time given by stepdown cycles turns out to not be sufficient, so that we do not have to deal with that on a BF-by-BF basis.

The purpose of this ticket is to serve as a place to link such BFs and to house a discussion on how we can more broadly address this class of failures.

A few ideas that came up during one of our BF meetings:

Increase the stepdown interval across the board.
Make the interval configurable from one variant to another (so we can increase for slow variants).
Wait more between retries of operations in our tests and/or increase their deadlines.

Feel free to voice your opinions in the comments.

Assignee:: Backlog - Server Tooling and Methods (STM) (Inactive)
Reporter:: Vesselina Ratcheva (Inactive)
Participants:: Backlog - Server Tooling and Methods (STM), Max Hirschhorn, Steven Vannelli, Vesselina Ratcheva
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Sep 11 2019 12:16:26 AM UTC
Updated:: Dec 06 2022 02:48:26 AM UTC

Details

Description

Attachments

Activity

People

Dates