Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-11171

Add metrics that give better insight into what checkpoint is doing

    • Storage Engines
    • 8
    • 2023-06-27 Lord of the Sprints, 2023-07-11 WiredTractor, 2023-07-25 Absolute unit, StorEng - 2023-08-08, ASeasonTooMany-2023-08-22
    • Needed
    • Triage and Release
    • Hide
      Checkpoint statistics were altered:
      * Checkpoint-related stats got moved out of the big "transaction" category into their own "checkpoint" category
      * The descriptions were altered to remove the redundant "transaction checkpoint" text (since them being in the checkpoint category already adds the "checkpoint" text to the description automatically)
      * "checkpoint running" became "checkpoint state", and works largely the same except that any non-zero value is equivalent to the old "1" value.
      * "checkpoint currently running for history store file" got folded into the checkpoint state and is no longer a separate statistic
      Show
      Checkpoint statistics were altered: * Checkpoint-related stats got moved out of the big "transaction" category into their own "checkpoint" category * The descriptions were altered to remove the redundant "transaction checkpoint" text (since them being in the checkpoint category already adds the "checkpoint" text to the description automatically) * "checkpoint running" became "checkpoint state", and works largely the same except that any non-zero value is equivalent to the old "1" value. * "checkpoint currently running for history store file" got folded into the checkpoint state and is no longer a separate statistic

      Checkpointing is one of the most user-visibly disruptive maintenance operations WiredTiger does. Sometimes a lot of time is spent doing checkpoints, and it isn't clear what it is doing with that time.

      We should add metrics and/or log lines that give better insight.

      An example from a customer shows checkpoints completing in between 15 and 20 minutes, spending 4 minutes of that time writing content (2 at the start, 2 at the end), with no meaningful indication of what happens for the remaining 11 to 16 minutes. There is no meaningful time being spent flushing checkpoint content to disk.

      A starting point for this is probably building or finding a workload that takes a long time to create a new checkpoint.

        1. Screenshot 2023-06-08 at 2.43.15 pm.png
          88 kB
          Alexander Gorrod
        2. Screenshot 2023-07-03 at 2.12.18 pm.png
          182 kB
          Will Korteland
        3. Screenshot 2023-07-03 at 2.12.27 pm.png
          211 kB
          Will Korteland
        4. Screenshot 2023-07-17 at 3.15.02 pm (2).png
          656 kB
          Will Korteland

            Assignee:
            will.korteland@mongodb.com Will Korteland
            Reporter:
            alexander.gorrod@mongodb.com Alexander Gorrod
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: