Reduce contention b/w sweep server and checkpoint prepare by skipping sweep during active checkpoint

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Engines - Foundations
    • SE Foundations - 2025-11-21
    • 5

      After investigation into latency issues on HELP ticket (linked) caused by prolonged checkpoint prepare times, it appears that checkpoint prepare is being delayed due to contention with the sweep server.

      When a checkpoint prepare is running, app threads wait on a schema lock held by the checkpoint. This phase normally completes within a reasonable time, but in observed cases, the checkpoint prepare duration extends significantly (up to ~4 minutes).

      Root cause:

      • The sweep server runs concurrently with checkpoint prepare and attempts to close dhandles.
      • This leads to contention between the sweep server and checkpoint prepare, especially when there are many active dhandles.
      • Other checkpoints with similar numbers of active dhandles complete faster, suggesting that the contention (not the number of dhandles alone) is the main contributor to the delay.

      Proposed solution:

      Enhance the sweep server logic to actively check whether a checkpoint is in progress and bail out (skip execution) if so.

      As part of the solution, there should be a workload that demonstrates the problem, before any changes. This can be used to verify the change. There is a benchmark in bench/dhandle that could be modified or cloned for this particular case.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Sid Mahajan
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: