Capture RSTL acquisition time distribution

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Replication
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Currently we have an RSTL acquisition timeout of 1 minute on step up / step down.

      It used to be 30 seconds, but we changed it to 1 minute because we suspected that in many cases we weren't seeing a deadlock but slowness, and that the slowness was too slow be to resolved in 30 seconds but fast enough to be resolved in under a minute.

      We made that decision kind of based on guesswork, just as we picked the original 30 second timeout value.

      The ask of this ticket is to capture the distribution of the amount of time it takes to acquire the RSTL in the successful case. By knowing what the distribution looks like (probably some distribution that starts heavy and drops off with a long tail) we'll know what a good timeout value is.

      The exact implementation of how to capture the timeout is unclear (whether via serverStatus / replSetGetStatus / etc).

            Assignee:
            Unassigned
            Reporter:
            Vishnu Kaushik
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: