Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- repl-metrics

Assigned Teams:

Replication
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

In the server, we have a parameter that controls how long a node will wait to acquire the RSTL on stepup or stepdown before crashing. The original goal was to take down deadlocked nodes, but we've seen in HELP tickets and AFs that there are many other possible reasons why a node exceeds the timeout.

This ticket is to design and run some experiments on the Atlas fleet that can help us tune this parameter. We may want to dynamically set this parameter based on instance size or other factors.

This ticket is a little open-ended, as we haven't previously explored experimenting and tuning our parameters.

is blocked by

SERVER-112402 Capture RSTL acquisition time distribution

Open

related to

SERVER-112402 Capture RSTL acquisition time distribution

Open

SERVER-112332 Create a dashboard tracking RSTL timeout failures

Blocked

Assignee:: Unassigned
Reporter:: Ali Mir
Participants:: Ali Mir
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Oct 15 2025 07:08:52 PM UTC
Updated:: Mar 10 2026 08:31:33 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates

PagerDuty