Run experiments with RSTL timeout parameter

    • Type: Task
    • Resolution: Unresolved
    • Priority: Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: None
    • Replication
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      In the server, we have a parameter that controls how long a node will wait to acquire the RSTL on stepup or stepdown before crashing. The original goal was to take down deadlocked nodes, but we've seen in HELP tickets and AFs that there are many other possible reasons why a node exceeds the timeout.

      This ticket is to design and run some experiments on the Atlas flee that can help us tune this parameter. We may want to dynamically set this parameter based on instance size or other factors.

      This ticket is a little open-ended, as we haven't previously explored experimenting and tuning our parameters.

            Assignee:
            Unassigned
            Reporter:
            Ali Mir
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: