-
Type:
Task
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Replication
-
Fully Compatible
-
None
-
None
-
None
-
None
-
None
-
None
-
None
`checkpointAdoptionLagMillis` currently compares the primary's latest checkpoint timestamp (gossiped via heartbeat) against the standby's installed checkpoint timestamp (received via ReadLog). Because the heartbeat arrives before the ReadLog marker, this creates a brief window where the metric shows the full timestamp gap between consecutive checkpoints — which is determined by the primary's checkpoint duration, not the standby's adoption speed. As a result, "adoption lag" is a misleading name for this metric.
We should improve our checkpoint diagnostics by doing the following:
Remove:
- checkpointAdoptionLagMillis — misleading; conflates primary checkpoint duration with standby adoption speed
Add:
- checkpointAdoptionDelayMillis — wall-clock time from receiving a checkpoint_end marker to installing it. This is the true measure of standby adoption speed.
- checkpointEndReceivedCount — counter for checkpoint_end markers received from ReadLog stream. Compare rate against totalCheckpointInstallCount to see if
checkpoints are being skipped. - materializedLsnAdvanceCount — counter for when _materializedLsn actually increases on the standby. Shows page server materialization progress as visible to
the standby. Can stay flat if the page server hasn't pushed new offsets. - materializedLsnAdvanceIntervalMillis — gauge showing milliseconds between consecutive _materializedLsn advances. Detects gaps in materialization visibility.
- pendingCheckpointCount — gauge showing number of checkpoint_end markers queued in CheckpointManager waiting for materialization/apply conditions. Growth
indicates the standby is falling behind.