-
Type:
Sub-task
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Checkpoints
-
None
-
Storage Engines, Storage Engines - Persistence
-
145.671
-
SE Persistence backlog
-
None
Background
We want better visibility into checkpoint duration over time so we can spot regressions and characterise checkpoint behaviour across workloads. Averages hide the tail; percentiles make slow checkpoints visible.
Goal
Create Grafana dashboards that plot checkpoint duration percentiles, sourced from the FTDC-derived WiredTiger stats.
Percentiles
Use 0.5, 0.9, 0.95 and 0.99.
Metrics
Two stats are needed because the metric was renamed over time, so both should be plotted:
quantiles( "quantile", 0.50, 0.90, 0.95, 0.99, (mongodb.serverStatus.wiredTiger.transaction.transaction_checkpoint_most_recent_time_msecs) / 1000 ) quantiles( "quantile", 0.50, 0.90, 0.95, 0.99, (mongodb.serverStatus.wiredTiger.checkpoint.most_recent_time_msecs) / 1000 )
Definition of Done
- Grafana dashboard(s) display p50/p90/p95/p99 of checkpoint duration (in seconds) for both metric names above.
- Dashboard link(s) shared on this ticket.