Reimplement checkpointAdoptionLagMillis and add checkpoint adoption observability metrics

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Fixed
    • Priority: Major - P3
    • 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Replication
    • Fully Compatible
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      `checkpointAdoptionLagMillis` currently compares the primary's latest checkpoint timestamp (gossiped via heartbeat) against the standby's installed checkpoint timestamp (received via ReadLog). Because the heartbeat arrives before the ReadLog marker, this creates a brief window where the metric shows the full timestamp gap between consecutive checkpoints — which is determined by the primary's checkpoint duration, not the standby's adoption speed. As a result, "adoption lag" is a misleading name for this metric.

      We should improve our checkpoint diagnostics by doing the following:

      Remove:

      • checkpointAdoptionLagMillis — misleading; conflates primary checkpoint duration with standby adoption speed

      Add:

      • checkpointAdoptionDelayMillis — wall-clock time from receiving a checkpoint_end marker to installing it. This is the true measure of standby adoption speed.
      • checkpointEndReceivedCount — counter for checkpoint_end markers received from ReadLog stream. Compare rate against totalCheckpointInstallCount to see if
        checkpoints are being skipped.
      • materializedLsnAdvanceCount — counter for when _materializedLsn actually increases on the standby. Shows page server materialization progress as visible to
        the standby. Can stay flat if the page server hasn't pushed new offsets.
      • materializedLsnAdvanceIntervalMillis — gauge showing milliseconds between consecutive _materializedLsn advances. Detects gaps in materialization visibility.
      • pendingCheckpointCount — gauge showing number of checkpoint_end markers queued in CheckpointManager waiting for materialization/apply conditions. Growth
        indicates the standby is falling behind.

            Assignee:
            Ali Mir
            Reporter:
            Ali Mir
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: