Investigate find_one_and_update regressions due to SERVER-124159

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Checkpoints
    • None
    • Storage Engines - Server Integration
    • 297.5
    • SE Persistence backlog
    • None

      SERVER-124519 enabled parallel checkpoints for disaggregated storage. By some metrics, this was overall an improvement (20 improvements/15 regressions), however some of the regressions are quite bad (on the order of 500%).

      While there was some discussion of this at the time the ticket was merged, it didn't actually say why this is OK. There was also some speculation that it was just noise, but this has turned out not to be the case – these 500% regressions have proven to be "sticky":

      The good news is that all of the big regressions came from just the find_one_and_update tasks, and only on the 11-node disagg variant. We should investigate what's going on with this particular task and at least have a reasonable explanation, and ideally a fix.

        1. image-2026-05-27-16-06-49-818.png
          202 kB
          Albert Song
        2. image-2026-05-27-16-08-26-333.png
          160 kB
          Albert Song
        3. image-2026-05-27-16-14-57-590.png
          84 kB
          Albert Song
        4. image-2026-05-27-16-16-16-887.png
          90 kB
          Albert Song
        5. image-2026-05-27-16-32-22-554.png
          419 kB
          Albert Song
        6. image-2026-05-27-16-34-40-975.png
          40 kB
          Albert Song
        7. image-2026-05-29-08-57-35-675.png
          90 kB
          Albert Song
        8. image-2026-05-29-19-08-05-308.png
          99 kB
          Mariam Mojid
        9. image-2026-05-29-19-11-07-055.png
          86 kB
          Mariam Mojid
        10. image-2026-05-29-19-33-30-189.png
          89 kB
          Mariam Mojid
        11. image-2026-05-29-20-01-17-134.png
          33 kB
          Mariam Mojid
        12. image-2026-05-29-20-02-31-544.png
          48 kB
          Mariam Mojid
        13. image-2026-05-29-20-08-03-194.png
          288 kB
          Mariam Mojid
        14. image-2026-05-29-20-08-42-154.png
          300 kB
          Mariam Mojid
        15. image-2026-05-29-20-13-03-298.png
          135 kB
          Mariam Mojid
        16. image-2026-05-29-20-32-22-951.png
          182 kB
          Mariam Mojid
        17. parallel_ckpt_enabled_vs_disabled.t2
          6 kB
          Mariam Mojid
        18. pc_disabled.png
          439 kB
          Mariam Mojid
        19. pc_enabled.png
          452 kB
          Mariam Mojid
        20. Screenshot_20260519_162352.png
          39 kB
          Will Korteland
        21. Screenshot 2026-05-29 at 7.21.09 pm.png
          128 kB
          Mariam Mojid
        22. screenshot-3.png
          113 kB
          Dennis Cheung

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Will Korteland
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: