MDB backups of WT want to know the "checkpoint timestamp" of the data pinned by a backup cursor. IIRC sue.loverso offered providing a query_timestamp API for doing so, to which I responded "we're good".
Instead MDB does optimistic concurrency control between opening backup cursors and taking checkpoints. We don't serialize the operations, but rather do the simple check + recheck:
- query_timestamp(get=last_checkpoint)
- open backup cursor
- query_timestamp(get=last_checkpoint)
If the values are the same, we know the timestamp of the data being copied by the backup. If we are not taking a checkpoint during any of these calls to query_timestamp, we believe the returned timestamp should always match.
We're observing cases where this assumption is not holding. We're seeing consecutive calls to query_timestamp returning different values, but no checkpoint was taken that would explain why. Unfortunately we don't have perfect logging of the values returned by "query_timestamp".
As an example:
D2 RECOVERY [WTCheckpointThread] Performing stable checkpoint. StableTimestamp: Timestamp(1623229619, 2), OplogNeededForRollback: Timestamp(1623229619, 2) D1 STORAGE [WTCheckpointThread] Checkpoint took 57 seconds to complete. I STORAGE [conn29862] Opened backup cursor. ID: <snip> Term: 547 I STORAGE [conn29862] Closed backup cursor. ID: <snip> <long log line saying an error was returned due to the timestamps not matching> D2 RECOVERY [WTCheckpointThread] Performing stable checkpoint. StableTimestamp: Timestamp(1623229735, 2), OplogNeededForRollback: Timestamp(1623229735, 2)
sue.loverso noted that the conn->txn_global.last_ckpt_timestamp value is written to without a mutex which could explain this observation.