[SERVER-58409] Startup RecordId initialization is flawed with durable history and reconstructing prepared transactions Created: 10/Jul/21 Updated: 29/Oct/23 Resolved: 29/Oct/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | 5.1.0, 4.4.6, 5.0.0-rc8 |
| Fix Version/s: | 5.2.0, 5.1.2, 5.0.6 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Daniel Gottlieb (Inactive) | Assignee: | Louis Williams |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v5.1, v5.0, v4.4
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Execution Team 2021-08-09, Execution Team 2021-08-23, Execution Team 2021-09-06, Execution Team 2021-09-20, Execution Team 2021-10-18, Execution Team 2021-11-01, Execution Team 2021-11-15 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 135 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
RecordIds by MDB are generated with an auto-incrementing integer. Initialization on a restart, or after rollback opens a cursor (with data at the stable timestamp) and does one reverse step to get the largest id in use for a document. Now consider the following sequence with durable history:
If we were to restart and initialize the next record id, we'd start issuing new documents RID(2). Normally this is fine. Any new replicated user writes must be generated with a timestamp larger than 30, so the update chain for RID(2) will remain valid. However, when reconstructing prepared transactions, the prepare timestamp (and thus any following commit timestamp, but not the durable timestamp) may be arbitrarily old. In this example, after initializing the next RID to 2, if we were to reconstruct a prepared transaction from TS(10) that performs an insert on this collection, we'd get the following update chain (from oldest to newest):
Committing the prepared insert at a value between 10 and 30 results in wrong results/inconsistent data when reading at those timestamps. |
| Comments |
| Comment by Githook User [ 01/Dec/21 ] | ||||||||||||
|
Author: {'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}Message: (cherry picked from commit c4f813e20892a5bd22d0d7b109542c2b242d4d5e) | ||||||||||||
| Comment by Githook User [ 01/Dec/21 ] | ||||||||||||
|
Author: {'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}Message: (cherry picked from commit c4f813e20892a5bd22d0d7b109542c2b242d4d5e) | ||||||||||||
| Comment by Githook User [ 29/Oct/21 ] | ||||||||||||
|
Author: {'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}Message: | ||||||||||||
| Comment by Louis Williams [ 27/Aug/21 ] | ||||||||||||
|
I discussed with vamsi.krishna and keith.smith and we agreed on the solution in | ||||||||||||
| Comment by Louis Williams [ 23/Aug/21 ] | ||||||||||||
I think you're mostly describing the correct scenario, but I want to clarify that this prepared transaction is not yet committed and is not in the process of being committed. The prepared transaction is being "reconstructed", but we have not received an oplog entry to commit yet.
I don't understand why this works. If we retry reconstructing the prepared transaction without a read timestamp, then we will just encounter the same bug that this ticket aims to fix. Presently, my proposed code review changes almost 300 lines of code, and it still hasn't correctly solved this problem. It makes a pretty complex change to our code to dance around the fact that what we really need is an API to return the last key in the table, regardless of visibility. I would like to revisit this idea. So to revisit the previous discussion, I'm not certain that the "overwrite=true" case is guaranteed to work in all cases, and I feel that this is really at the wrong entry point to WiredTiger. The ideal solution would be in the read path, not the write path. The other option would be my proposed alternative of adding another assertion suppression in WiredTiger. I've filed | ||||||||||||
| Comment by Louis Williams [ 20/Aug/21 ] | ||||||||||||
|
Update: I've run into some issues with the way we timestamp multikey writes during prepared transaction reconstruction. Specifically, I'm hitting this error code: "commit timestamp _ must be greater than the latest active read timestamp". I am trying to think through workarounds to this problem, but we are essentially reading at the stable timestamp and committing at the stable timestamp, I don't think there is a "safe" timestamp to use. More specifically: vamsi.krishna/chenhao.qu, would it be reasonable to add an additional suppression for this assertion where "round_timestamps=(prepared=true)"? One alternative I see is going back to using ghost timestamps (using timestamps further in the past), which are more problematic and incorrect. It would not be possible to just generate a new OpTime because the recovering node is not a primary. | ||||||||||||
| Comment by Louis Williams [ 11/Aug/21 ] | ||||||||||||
|
daniel.gottlieb, agreed that would be a problem if we default all writes use overwrite=false and also use this new configuration. We could just use this configuration when applying prepared transactions, however, I have some concerns about how we would work this into our current RecordStore API. That would be a messier change for us than just being able to suppress the assertion in WT. | ||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 11/Aug/21 ] | ||||||||||||
|
Re:
We always hit that error. That assertion will always trigger when a single WT transaction reads at X and then tries to prepare at X.
The algorithm is incorrect if we choose a different (or omit the) read timestamp. We may fail to generate a write conflict and get back into the same situation of creating an out of order update chain. Re:
From what I can tell, our typical record stores use overwrite=true. I couldn't say if we actually have to use overwrite=true.
Assuming we can and want to change these inserts to use overwrite=true, this sounds like it would cause conflicts even when we choose a good timestamp (i.e: false positives for all of oplog recovery). That could be a potentially unbounded cost that everyone would have to pay. Does my interpretation match your understanding louis.williams? | ||||||||||||
| Comment by Chenhao Qu [ 11/Aug/21 ] | ||||||||||||
|
daniel.gottlieb louis.williams We have reviewed the choices again. We now have two new proposals:
Let us know which way you would like to go. | ||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 10/Aug/21 ] | ||||||||||||
|
An aside: the reason for using the read timestamp is subtle and clever. This is why I find the other option of getting the (actual) largest key in the table preferable. But more important right now would be expediency. We can always do one solution then switch to the other. In my mind, waiting a week for the preferable solution is fine, but if it were to drag out for more than a sprint, I'd opt for the quicker fix. | ||||||||||||
| Comment by Louis Williams [ 10/Aug/21 ] | ||||||||||||
|
chenhao.qu, when we reconstruct a prepared insert, we don't read RID 2 before inserting. We will just blindly insert RID 2 with overwrite=true. In this example, that will write conflict because we're reading at timestamp 10 and there is an insert at TS 20. We then just retry (after incrementing the RID counter) and blindly insert until the write succeeds with RID 3. | ||||||||||||
| Comment by Chenhao Qu [ 10/Aug/21 ] | ||||||||||||
|
daniel.gottlieb We had another look at your proposed solution. We don't understand how it works. Taking your example, suppose we have done the following operations before restart:
From my understanding, you will first use timestamp 0 to read the largest key and you should get 1 in this case and decide the next entry should use key 2. Is my understanding correct? If that is correct, your solution seems not working. If that is not what you mean, can you walk me through your example how everything is designed to work step by step. | ||||||||||||
| Comment by Louis Williams [ 05/Aug/21 ] | ||||||||||||
|
Thanks for the suggestions, chenhao.qu. I'll experiment with this. | ||||||||||||
| Comment by Chenhao Qu [ 05/Aug/21 ] | ||||||||||||
|
daniel.gottlieb I have a new idea to solve this. In WiredTiger cursor, we introduced a flag WT_CURSTD_IGNORE_TOMBSTONE. If you set that flag, it will ignore the deletion of RID 2 when you read the data. Does that solve your issue to find the largest record id? If you think that can work, we can discuss to add a cursor config to set that flag when you open the cursor. | ||||||||||||
| Comment by Louis Williams [ 03/Aug/21 ] | ||||||||||||
|
daniel.gottlieb, this makes sense to me. I just tested this algorithm with the reproducer and this seems to work. I opened | ||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 03/Aug/21 ] | ||||||||||||
That's not the algorithm I was envisioning. I was thinking:
Using your example, the next record id for an insert would be initialized to 4. IIUC, that would be a suitable key to prepare the insert with. Does the more detailed algorithm still seem problematic? | ||||||||||||
| Comment by Louis Williams [ 03/Aug/21 ] | ||||||||||||
|
daniel.gottlieb, I've been thinking over this problem, and I think the proposed fix just defers the same problem past the prepared transaction reconstruction phase. For example. Say there is a collection with update chains like this: RID 1: A @ TS(5) -> Tombstone @ TS(15) Say we prepared an insert at TS 10 that is then reconstructed after a restart. We use our new proposal to read at TS 10, this initializes the highest seen RID to 1, and then inserts the new record at RID 2. If a new operation inserts a record (read TS >= 20), it will insert a new record at RID 3. This conflicts with an existing record. Because MongoDB configures cursors to "overwrite=true", this overwrites an existing record. I can introduce this failure with a slightly modified test: server-58409.js | ||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 02/Aug/21 ] | ||||||||||||
|
louis.williams and I were unable to formulate a non-WT change to get around the assertion. However, we have the following patch that suppresses that assertion when using the existing WT flag for rounding up a prepared transactions timestamp to the oldest timestamp (only used for reconstructing prepared transactions):
| ||||||||||||
| Comment by Louis Williams [ 30/Jul/21 ] | ||||||||||||
|
Option 3 doesn't work because of the WiredTiger assertion: "prepare timestamp X must be greater than the latest active read timestamp Y" | ||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 10/Jul/21 ] | ||||||||||||
|
A couple less attractive solutions followed by a clever trick?
|