[SERVER-63166] Restrict prepare_timestamp to be > stable, not >= oldest Created: 01/Feb/22  Updated: 08/Feb/22  Resolved: 08/Feb/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Keith Bostic (Inactive) Assignee: Josef Ahmad
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by WT-8745 Restrict prepare_timestamp to be > st... Closed
Operating System: ALL
Sprint: Execution Team 2022-02-21
Participants:

 Description   

The WiredTiger drop incorrectly bounds the prepare timestamp at >= oldest, when the correct bound is > stable.

This is a potential data consistency error.

The WiredTiger ticket is WT-8745.

The patch build is here: https://spruce.mongodb.com/version/61f88392a4cf476b5c07c14a/tasks

Running a patch build on the change produces one consistent error, noPassthrough fails, but there's no obvious error in the output.

Can someone please help us investigate this problem? – Thanks!



 Comments   
Comment by Keith Bostic (Inactive) [ 08/Feb/22 ]

josef.ahmad, after discussing this with daniel.gottlieb, I believe this is going to be resolved entirely in the WiredTiger layer, there won't be any need for changes to MDB Server. Thank you for your investigation, and I'm closing SERVER-63166 until and unless the expected resolution changes.

Comment by Daniel Gottlieb (Inactive) [ 07/Feb/22 ]

keith.bostic, I believe there's two separate situations to consider for bounding the prepare timestamp.

Outside of recovery (a primary or secondary processing live writes), I agree that a prepare timestamp must be > the stable timestamp. But during recovery (such as across restart), we can be re-preparing transactions that are arbitrarily old. MDB uses the roundup_timestamps=(prepared=true) configuration for this case. I can see your patch changed this rounding from oldest -> stable:

-         * Check whether the prepare timestamp needs to be rounded up to the oldest timestamp.
+         * Check whether the prepare timestamp needs to be rounded up to the stable timestamp.
          */
         if (F_ISSET(txn, WT_TXN_TS_ROUND_PREPARED)) {
             __wt_verbose(session, WT_VERB_TIMESTAMP,
-              "prepare timestamp %s rounded to oldest timestamp %s",
+              "prepare timestamp %s rounded to %s (stable timestamp + 1)",
               __wt_timestamp_to_string(prepare_ts, ts_string[0]),
-              __wt_timestamp_to_string(oldest_ts, ts_string[1]));
+              __wt_timestamp_to_string(stable_ts + 1, ts_string[1]));
 
-            prepare_ts = oldest_ts;
+            prepare_ts = stable_ts + 1;
         } else

Our test here is catching a correctness problem introduced by this portion of the patch. I'm happy to elaborate more re: what's special about recovery that brings us into this situation.

I do think there's value in enforcing prepare > stable when the "round up" configuration is missing. But roundup itself must not round beyond the oldest timestamp.

Comment by Josef Ahmad [ 07/Feb/22 ]

Here's the sequence of events that lead to the prepare_recordid_initialization.js failure:

  • Timestamp(1644242604, 2) Insert RID: 1
  • Timestamp(1644242604, 3) Pin this timestamp as the oldest timestamp
  • Timestamp(1644242604, 5) Prepare TXN insert RID: 2
  • Timestamp(1644242604, 6) Insert RID: 3
  • Timestamp(1644242604, 7) Insert RID: 4
  • Timestamp(1644242604, 8) Delete RID: 4
  • mongod restarts and starts recovering from stable timestamp. Note the last log line in the snippet below: due to WT-8745, WT resets the prepare timestamp from its original value (>= oldest) to a value that's > stable.

    "msg":"WiredTiger recoveryTimestamp","attr":{"recoveryTimestamp":{"$timestamp":{"t":1644242604,"i":8}}}}
    "msg":"WiredTiger oldestTimestamp","attr":{"oldestTimestamp":{"$timestamp":{"t":1644242604,"i":3}}}}
    "msg":"Recovering from stable timestamp","attr":{"stableTimestamp":{"$timestamp":{"t":1644242604,"i":8}},"topOfOplog":{"ts":{"$timestamp":{"t":1644242604,"i":8}},"t":1},"appliedThrough":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}}}
    "msg":"WiredTiger message","attr":{[..]"msg":"prepare timestamp (1644242604, 5) rounded to (1644242604, 9) (stable timestamp + 1)"}}}
    

  • The test commits the prepared transaction. Although the log reports the original prepare Timestamp(1644242604, 5), the transaction actually commits with the new Timestamp(1644242604, 9) due to the above.

    {"t":{"$date":"2022-02-07T14:03:33.298+00:00"},"s":"I",  "c":"TXN",      "id":51802,   "ctx":"conn1","msg":"transaction","attr":{"parameters":{"lsid":{"id":{"$uuid":"1f9c85aa-7000-4aed-af90-49cf2aae5106"},"uid":{"$binary":{"base64":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}}},"txnNumber":0,"txnRetryCounter":0,"autocommit":false,"readConcern":{}},"readTimestamp":"Timestamp(0, 0)","terminationCause":"committed","timeActiveMicros":7168,"timeInactiveMicros":182265,"numYields":0,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":4}},"Global":{"acquireCount":{"w":3}},"Database":{"acquireCount":{"w":3}},"Collection":{"acquireCount":{"w":3}},"Mutex":{"acquireCount":{"r":2}}},"storage":{"data":{"bytesRead":314,"timeReadingMicros":17}},"wasPrepared":true,"totalPreparedDurationMicros":185673,"prepareOpTime":{"ts":{"$timestamp":{"t":1644242604,"i":5}},"t":1},"durationMillis":189}}
    

  • The test fails as it's unable to read the RID:2 record at Timestamp(1644242604, 7), likely because the the prepare timestamp has been moved to the future Timestamp(1644242604, 9).
Generated at Thu Feb 08 05:57:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.