-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Minor - P4
-
None
-
Affects Version/s: None
-
Component/s: None
-
Replication
-
None
-
None
-
None
-
None
-
None
-
None
-
None
In the server, we have a parameter that controls how long a node will wait to acquire the RSTL on stepup or stepdown before crashing. The original goal was to take down deadlocked nodes, but we've seen in HELP tickets and AFs that there are many other possible reasons why a node exceeds the timeout.
This ticket is to design and run some experiments on the Atlas flee that can help us tune this parameter. We may want to dynamically set this parameter based on instance size or other factors.
This ticket is a little open-ended, as we haven't previously explored experimenting and tuning our parameters.
- related to
-
SERVER-112402 Capture RSTL acquisition time distribution
-
- Open
-
-
SERVER-112332 Create a dashboard tracking RSTL timeout failures
-
- Blocked
-