[SERVER-58409] Startup RecordId initialization is flawed with durable history and reconstructing prepared transactions Created: 10/Jul/21  Updated: 29/Oct/23  Resolved: 29/Oct/21

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 5.1.0, 4.4.6, 5.0.0-rc8
Fix Version/s: 5.2.0, 5.1.2, 5.0.6

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Louis Williams
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 25686_repro.js     File server-58409.js    
Issue Links:
Backports
Depends
depends on WT-8241 Skip value return for largest key Closed
depends on WT-8226 Fix largest_key failed to consider pr... Closed
depends on WT-7918 Allow setting the prepare timestamp s... Closed
depends on WT-7992 Provide API to return the last key in... Closed
is depended on by WT-8114 Revert allow setting the prepare time... Closed
Problem/Incident
causes SERVER-62650 RecordStore RecordId initialization c... Closed
Related
related to WT-7783 Fix RTS to restore tombstone when an ... Closed
related to WT-7815 Properly initialize prev_upd_ts for o... Closed
related to WT-7820 Retrieve the on-disk durable timestam... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.1, v5.0, v4.4
Sprint: Execution Team 2021-08-09, Execution Team 2021-08-23, Execution Team 2021-09-06, Execution Team 2021-09-20, Execution Team 2021-10-18, Execution Team 2021-11-01, Execution Team 2021-11-15
Participants:
Linked BF Score: 135

 Description   

RecordIds by MDB are generated with an auto-incrementing integer. Initialization on a restart, or after rollback opens a cursor (with data at the stable timestamp) and does one reverse step to get the largest id in use for a document.

Now consider the following sequence with durable history:

  • Set OldestTimestamp 1
  • Insert RecordId(1) -> A at TimeStamp(10)
  • Insert RID(2) -> B at TS(20)
  • Delete RID(2) (B) at TS(30)

If we were to restart and initialize the next record id, we'd start issuing new documents RID(2). Normally this is fine. Any new replicated user writes must be generated with a timestamp larger than 30, so the update chain for RID(2) will remain valid.

However, when reconstructing prepared transactions, the prepare timestamp (and thus any following commit timestamp, but not the durable timestamp) may be arbitrarily old.

In this example, after initializing the next RID to 2, if we were to reconstruct a prepared transaction from TS(10) that performs an insert on this collection, we'd get the following update chain (from oldest to newest):

RID(2) => B @ TS(20) -> <tombstone> @ TS(30) -> PreparedInsert @ TS(10)

Committing the prepared insert at a value between 10 and 30 results in wrong results/inconsistent data when reading at those timestamps.



 Comments   
Comment by Githook User [ 01/Dec/21 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-58409 Use WiredTiger largest_key API to initialize the highest RecordId in a collection
This fixes a bug that allows a RecordId to be incorrectly reused during startup recovery's prepared transaction reconstruction.

(cherry picked from commit c4f813e20892a5bd22d0d7b109542c2b242d4d5e)
Branch: v5.0
https://github.com/mongodb/mongo/commit/c0022ed6effd3419c553597ef575053c80e8aca9

Comment by Githook User [ 01/Dec/21 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-58409 Use WiredTiger largest_key API to initialize the highest RecordId in a collection
This fixes a bug that allows a RecordId to be incorrectly reused during startup recovery's prepared transaction reconstruction.

(cherry picked from commit c4f813e20892a5bd22d0d7b109542c2b242d4d5e)
Branch: v5.1
https://github.com/mongodb/mongo/commit/ee678223f9e3ae68a27b3c0f434c2449b903196b

Comment by Githook User [ 29/Oct/21 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-58409 Use WiredTiger largest_key API to initialize the highest RecordId in a collection
This fixes a bug that allows a RecordId to be incorrectly reused during startup recovery's prepared transaction reconstruction.
Branch: master
https://github.com/mongodb/mongo/commit/c4f813e20892a5bd22d0d7b109542c2b242d4d5e

Comment by Louis Williams [ 27/Aug/21 ]

I discussed with vamsi.krishna and keith.smith and we agreed on the solution in WT-7992 to implement the desired "return the last key" behavior.

Comment by Louis Williams [ 23/Aug/21 ]

vamsi.krishna,

my understanding is the read timestamp is of the same transaction which is trying to commit the prepared transaction

I think you're mostly describing the correct scenario, but I want to clarify that this prepared transaction is not yet committed and is not in the process of being committed. The prepared transaction is being "reconstructed", but we have not received an oplog entry to commit yet.

So, when we get this error, we could roll back the existing transaction and start a new one without any read timestamp and commit the prepared transaction.

I don't understand why this works. If we retry reconstructing the prepared transaction without a read timestamp, then we will just encounter the same bug that this ticket aims to fix.

Presently, my proposed code review changes almost 300 lines of code, and it still hasn't correctly solved this problem. It makes a pretty complex change to our code to dance around the fact that what we really need is an API to return the last key in the table, regardless of visibility. I would like to revisit this idea.

So to revisit the previous discussion, I'm not certain that the "overwrite=true" case is guaranteed to work in all cases, and I feel that this is really at the wrong entry point to WiredTiger. The ideal solution would be in the read path, not the write path. The other option would be my proposed alternative of adding another assertion suppression in WiredTiger. I've filed WT-7992. Let me know what you think. CC alexander.gorrod

Comment by Louis Williams [ 20/Aug/21 ]

Update: I've run into some issues with the way we timestamp multikey writes during prepared transaction reconstruction. Specifically, I'm hitting this error code: "commit timestamp _ must be greater than the latest active read timestamp". I am trying to think through workarounds to this problem, but we are essentially reading at the stable timestamp and committing at the stable timestamp, I don't think there is a "safe" timestamp to use.

More specifically:
MongoDB is performing a write during startup recovery of a prepared transaction where the read timestamp = prepare timestamp case is true. Normally a commit of this prepared transaction would assign a new OpTime, but we have this multikey edge case where we start a side transaction that tries to perform a write where commit timestamp = read timestamp. The code rounds up the commit timestamp to the stable timestamp, but the stable timestamp is also equal to this timestamp. This is failing the assertion that the commit timestamp is greater than the latest active read timestamp.

vamsi.krishna/chenhao.qu, would it be reasonable to add an additional suppression for this assertion where "round_timestamps=(prepared=true)"?

One alternative I see is going back to using ghost timestamps (using timestamps further in the past), which are more problematic and incorrect. It would not be possible to just generate a new OpTime because the recovering node is not a primary.

Comment by Louis Williams [ 11/Aug/21 ]

daniel.gottlieb, agreed that would be a problem if we default all writes use overwrite=false and also use this new configuration. We could just use this configuration when applying prepared transactions, however, I have some concerns about how we would work this into our current RecordStore API. That would be a messier change for us than just being able to suppress the assertion in WT.

Comment by Daniel Gottlieb (Inactive) [ 11/Aug/21 ]

Re:

I think your proposed trick to use prepared timestamp as the read timestamp should work. The error you encountered that "prepared timestamp cannot be smaller or equal to active read timestamp" can be work around.

I think you only need to rollback the current transaction if you hit that error

We always hit that error. That assertion will always trigger when a single WT transaction reads at X and then tries to prepare at X.

and then start a new transaction with a larger read timestamp or no read timestamp to reconstruct the prepared update again

The algorithm is incorrect if we choose a different (or omit the) read timestamp. We may fail to generate a write conflict and get back into the same situation of creating an out of order update chain.

Re:

suggested adding a new config when overwrite=false.

From what I can tell, our typical record stores use overwrite=true. I couldn't say if we actually have to use overwrite=true.

After the proposed change, with the config set, we will return conflict if the tombstone is not globally visible (i.e the entry is valid at some point after the oldest timestamp).

Assuming we can and want to change these inserts to use overwrite=true, this sounds like it would cause conflicts even when we choose a good timestamp (i.e: false positives for all of oplog recovery). That could be a potentially unbounded cost that everyone would have to pay.

Does my interpretation match your understanding louis.williams?

Comment by Chenhao Qu [ 11/Aug/21 ]

daniel.gottlieb louis.williams We have reviewed the choices again. We now have two new proposals:

  • I think your proposed trick to use prepared timestamp as the read timestamp should work. The error you encountered that "prepared timestamp cannot be smaller or equal to active read timestamp" can be work around. I think you only need to rollback the current transaction if you hit that error and then start a new transaction with a larger read timestamp or no read timestamp to reconstruct the prepared update again. I'm not sure if that is possible or not to rollback that transaction. If that is possible, it should work around that problem.
  • alexander.gorrod suggested adding a new config when overwrite=false. When this new config is set the cursor insert operation will return conflict if the key exists irrespective of its visibility. In the current implementation, if the tombstone is visible, we don't return conflict for an insert, if overwrite=false. After the proposed change, with the config set, we will return conflict if the tombstone is not globally visible (i.e the entry is valid at some point after the oldest timestamp).

Let us know which way you would like to go.

Comment by Daniel Gottlieb (Inactive) [ 10/Aug/21 ]

An aside: the reason for using the read timestamp is subtle and clever. This is why I find the other option of getting the (actual) largest key in the table preferable. But more important right now would be expediency. We can always do one solution then switch to the other.

In my mind, waiting a week for the preferable solution is fine, but if it were to drag out for more than a sprint, I'd opt for the quicker fix.

Comment by Louis Williams [ 10/Aug/21 ]

chenhao.qu, when we reconstruct a prepared insert, we don't read RID 2 before inserting. We will just blindly insert RID 2 with overwrite=true. In this example, that will write conflict because we're reading at timestamp 10 and there is an insert at TS 20. We then just retry (after incrementing the RID counter) and blindly insert until the write succeeds with RID 3.

Comment by Chenhao Qu [ 10/Aug/21 ]

daniel.gottlieb We had another look at your proposed solution. We don't understand how it works. Taking your example, suppose we have done the following operations before restart:

Set OldestTimestamp 1
Insert RecordId(1) -> A at TimeStamp(10)
Insert RID(2) -> B at TS(20)
Delete RID(2) (B) at TS(30)

From my understanding, you will first use timestamp 0 to read the largest key and you should get 1 in this case and decide the next entry should use key 2.
Then suppose you want to reconstruct a prepared update with timestamp 10, you read key 2 with timestamp 10 again. If that returns a key, then you know there is a deleted key so you need to use key 3 instead of 2 for the new record. However, in your example, if you read at timestamp 10, you will still get not found because the key only exists from timestamp 20 onwards.

Is my understanding correct? If that is correct, your solution seems not working. If that is not what you mean, can you walk me through your example how everything is designed to work step by step.

Comment by Louis Williams [ 05/Aug/21 ]

Thanks for the suggestions, chenhao.qu. I'll experiment with this.

Comment by Chenhao Qu [ 05/Aug/21 ]

daniel.gottlieb I have a new idea to solve this. In WiredTiger cursor, we introduced a flag WT_CURSTD_IGNORE_TOMBSTONE. If you set that flag, it will ignore the deletion of RID 2 when you read the data. Does that solve your issue to find the largest record id? If you think that can work, we can discuss to add a cursor config to set that flag when you open the cursor.

Comment by Louis Williams [ 03/Aug/21 ]

daniel.gottlieb, this makes sense to me. I just tested this algorithm with the reproducer and this seems to work. I opened WT-7918 to track the WiredTiger request.

Comment by Daniel Gottlieb (Inactive) [ 03/Aug/21 ]

Say we prepared an insert at TS 10 that is then reconstructed after a restart. We use our new proposal to read at TS 10, this initializes the highest seen RID to 1,

That's not the algorithm I was envisioning. I was thinking:

  • On startup, initialize the RecordId for each collection as we do today, by using a non-timestamped read + prev'ing the cursor (I believe the lazy loading is inconsequential). This gets the highest record id in use at "now".
  • When reconstructing prepared transactions, assign a read timestamp in an effort to force a conflict when the prepare range overlaps with a delete that's timestamped between the oldest and stable timestamps (when applicable).

Using your example, the next record id for an insert would be initialized to 4. IIUC, that would be a suitable key to prepare the insert with.

Does the more detailed algorithm still seem problematic?

Comment by Louis Williams [ 03/Aug/21 ]

daniel.gottlieb, I've been thinking over this problem, and I think the proposed fix just defers the same problem past the prepared transaction reconstruction phase. For example. Say there is a collection with update chains like this:

RID 1: A @ TS(5) -> Tombstone @ TS(15)
RID 2: deleted
RID 3: B @ TS(20)

Say we prepared an insert at TS 10 that is then reconstructed after a restart. We use our new proposal to read at TS 10, this initializes the highest seen RID to 1, and then inserts the new record at RID 2. If a new operation inserts a record (read TS >= 20), it will insert a new record at RID 3. This conflicts with an existing record. Because MongoDB configures cursors to "overwrite=true", this overwrites an existing record. I can introduce this failure with a slightly modified test: server-58409.js.

Comment by Daniel Gottlieb (Inactive) [ 02/Aug/21 ]

louis.williams and I were unable to formulate a non-WT change to get around the assertion. However, we have the following patch that suppresses that assertion when using the existing WT flag for rounding up a prepared transactions timestamp to the oldest timestamp (only used for reconstructing prepared transactions):

diff --git a/src/third_party/wiredtiger/src/txn/txn_timestamp.c b/src/third_party/wiredtiger/src/txn/txn_timestamp.c
index 9fc79a78d7..b290fee7dd 100644
--- a/src/third_party/wiredtiger/src/txn/txn_timestamp.c
+++ b/src/third_party/wiredtiger/src/txn/txn_timestamp.c
@@ -714,7 +714,8 @@ __wt_txn_set_prepare_timestamp(WT_SESSION_IMPL *session, wt_timestamp_t prepare_
         WT_RET_MSG(session, EINVAL,
           "commit timestamp should not have been set before the prepare timestamp");
-    WT_RET(__txn_assert_after_reads(session, "prepare", prepare_ts));
+    if (!F_ISSET(txn, WT_TXN_TS_ROUND_PREPARED))
+        WT_RET(__txn_assert_after_reads(session, "prepare", prepare_ts));
     /*
      * Check whether the prepare timestamp is less than the oldest timestamp.

Comment by Louis Williams [ 30/Jul/21 ]

Option 3 doesn't work because of the WiredTiger assertion: "prepare timestamp X must be greater than the latest active read timestamp Y"

Comment by Daniel Gottlieb (Inactive) [ 10/Jul/21 ]

A couple less attractive solutions followed by a clever trick?

  1. Have WT expose an API, e.g: WT_SESSION::getLargestKeyForTable(char* tableName, vararg *keyOutParam)
  2. At startup, instead of "get largest record id in current (i.e: stable timestamp) snapshot" do:

    for each timestamp between oldest timestamp and stable timestamp:
      ret = max(ret, getLargestRecordIdAt(timestamp));
    

  3. When reconstructing prepared transactions at startup/after rollback, set the read timestamp to the prepared timestamp with roundToOldest=true. On WCE, retry the operation (instead of presumably crashing today). I expect the retry to naturally try inserting with the next available record id until the reconstruction succeeds.
Generated at Thu Feb 08 05:44:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.