Investigate timeout failures in 100GB resharding locust workload

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • Fully Compatible
    • ClusterScalability Dec8-Dec22
    • 0
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Background

      The 100GB Locust resharding workload has historically failed to enter the critical section under intense write loads due to timeouts. The system was unable to meet the entry condition(remaining estimates < 500ms) within the 6‑hour limit.

      This issue was largely mitigated in SERVER-110169 by introducing short 10‑second "dips" where the write load was halved every 5 minutes. After this change, performance improved significantly, with successful runs completing the resharding phase in about 4.5 hours.

      However, recent runs have begun failing again. In the latest failure, the last remaining catch‑up time estimate was 826,083 ms (~13.8 minutes), preventing the coordinator from entering the critical section. This estimate is unusually large and never dropped below the 500 ms threshold during the 6‑hour window.

      Link to write‑up with additional examples.

       

      Investigation Goal

      Identify the cause of the unusually large remaining time estimates and the recurring timeout failures in the 100GB Locust resharding workload.

       

      Note: Please use the log analysis script to investigate workload failures more effectively. 

            Assignee:
            Cheahuychou Mao
            Reporter:
            Kruti Shah
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: