-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Replication
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Currently we have an RSTL acquisition timeout of 1 minute on step up / step down.
It used to be 30 seconds, but we changed it to 1 minute because we suspected that in many cases we weren't seeing a deadlock but slowness, and that the slowness was too slow be to resolved in 30 seconds but fast enough to be resolved in under a minute.
We made that decision kind of based on guesswork, just as we picked the original 30 second timeout value.
The ask of this ticket is to capture the distribution of the amount of time it takes to acquire the RSTL in the successful case. By knowing what the distribution looks like (probably some distribution that starts heavy and drops off with a long tail) we'll know what a good timeout value is.
The exact implementation of how to capture the timeout is unclear (whether via serverStatus / replSetGetStatus / etc).
- is related to
-
SERVER-112332 Create a dashboard tracking RSTL timeout failures
-
- Blocked
-