Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Critical - P2
Fix Version/s: 7.2.0-rc0, 7.1.0-rc4
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Storage Execution NAMER
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v7.1
Sprint:
Execution NAMR Team 2023-10-02
Linked BF Score:
148
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

After ~~SERVER-81032~~, the backup cursor service reports backup cursor checkpointTimestamp that does not match the actual checkpointTimestamp at which WT opened a backup cursor, i.e, reported checkpointTimestamp can be <= actual checkpointTimestamp) , instead of ==

Given the fact committing the checkpoint and updating txn_global.last_ckpt_timestamp (reported by getLastStableRecoveryTimestamp()) aren't atomic. This means, we can end up a scenario, like below

1) CKPT thread: WT checkpoint committed for TS(100) with ckptId:100
2) BackupService thread: Opens the _mdb_catalog cursor with read source as KCheckpoint.

This will open the checkpoint cursor on the latest checkpoint, ckptId:100

3) BackupService thread: Calls getLastStableRecoveryTimestamp() and reads the previous checkpoint ts values , say TS(90).
4) CKPT thread: Updates the {{txn_global.last_ckpt_timestamp }} to TS(100)
5) BackupService thread: Opens the backup cursor
6) BackupService thread: Verifies if any checkpoint was taken between step #3 and #5 .

For which, It agains opens the checkpoint cursor on _mdb_catalog and reads checkpoint id as ckptId:100, and compares with step#2 checkpoint Id.

Since, step #2 and step#6 checkpoint Id are same, the sanity check in step#6 passes. However, now the backup cursor returns the `checkpointTimestamp` as TS(90) (ie, step #3 value) instead of actual checkpoint ts value at which WT opened backup cursor, which is TS(100).

Before ~~SERVER-81032~~, given the fact WT takes checkpoint lock when opening the backup cursor (step #5) and for the entire checkpoint job, at step#6, calling the getLastStableRecoveryTimestamp() would guarantee to return at least TS(100), in the above case . And, I think, any new checkpoints between step#5 and #6 is uninteresting. So, it's ok, even if Step#6 reads the stale last checkpoint ts.

My proposal would be to make step 6 to use the original way , which is using `getLastStableRecoveryTimestamp()`

is related to

SERVER-81032 Fix checkpoint detection while opening a backup cursor

Closed

WT-11709 API to retrieve timestamp of a checkpoint cursor

Closed

SERVER-81208 Use checkpoint cursor timestamp when opening backup cursor

Closed

Assignee:: Gregory Noma
Reporter:: Suganthi Mani
Participants:: Gregory Noma, Suganthi Mani
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Sep 19 2023 06:55:41 AM UTC
Updated:: Oct 29 2023 09:16:13 PM UTC
Resolved:: Sep 20 2023 08:41:43 PM UTC
Confidence Status Last Update:: 19/Sep/23 5:24 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates