Create Grafana dashboards for checkpoint duration percentiles

XMLWordPrintableJSON

    • Type: Sub-task
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Checkpoints
    • None
    • Storage Engines, Storage Engines - Persistence
    • 145.671
    • SE Persistence backlog
    • None

      Background

      We want better visibility into checkpoint duration over time so we can spot regressions and characterise checkpoint behaviour across workloads. Averages hide the tail; percentiles make slow checkpoints visible.

      Goal

      Create Grafana dashboards that plot checkpoint duration percentiles, sourced from the FTDC-derived WiredTiger stats.

      Percentiles

      Use 0.5, 0.9, 0.95 and 0.99.

      Metrics

      Two stats are needed because the metric was renamed over time, so both should be plotted:

      quantiles(
        "quantile",
        0.50, 0.90, 0.95, 0.99,
        (mongodb.serverStatus.wiredTiger.transaction.transaction_checkpoint_most_recent_time_msecs) / 1000
      )
      
      quantiles(
        "quantile",
        0.50, 0.90, 0.95, 0.99,
        (mongodb.serverStatus.wiredTiger.checkpoint.most_recent_time_msecs) / 1000
      )
      

      Definition of Done

      • Grafana dashboard(s) display p50/p90/p95/p99 of checkpoint duration (in seconds) for both metric names above.
      • Dashboard link(s) shared on this ticket.

            Assignee:
            Etienne Petrel
            Reporter:
            Etienne Petrel
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: