[SERVER-63322] Reading between commit_timestamp and durable_timestamp can produce inconsistency Created: 04/Feb/22  Updated: 27/Oct/23  Resolved: 24/Feb/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Keith Bostic (Inactive) Assignee: Backlog - Storage Execution Team
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by WT-8747 Reading between commit_timestamp and ... Closed
Related
related to SERVER-63615 Avoid deadlocks in wiredtiger_prepare... Closed
Assigned Teams:
Storage Execution
Operating System: ALL
Sprint: Execution Team 2022-02-21, Execution Team 2022-03-07
Participants:

 Description   

There is a problem with prepared transactions whose durable timestamp is greater than their commit timestamp, which is that it's possible for another transaction to read them and commit before them, then get checkpointed without them, such that a crash causes the first transaction to disappear but not the second that depended on it.

See WT-8747 for additional information and examples.

This is a potential data consistency error.

The ideal fix for WiredTiger is to return ::WT_ROLLBACK to the reader, which requires the reader to handle transaction rollback and which may not be possible for Server. We considered returning ::WT_NOTFOUND in this case (although that is problematical as well and could lead to its own problems), but patch builds returning the two different errors have roughly similar numbers of failures.

The patch build returning ::WT_ROLLBACK is here: https://evergreen.mongodb.com/version/61fc61400305b973ad2bf953?redirect_spruce_users=true

The patch build returning ::WT_NOTFOUND is here: https://evergreen.mongodb.com/version/61fd6248a4cf472d4bc3243c?redirect_spruce_users=true

Can someone please help us investigate this problem? – Thanks!

cc: geert.bosch, louis.williams



 Comments   
Comment by Henrik Edin [ 24/Feb/22 ]

Closed this ticket per your comment keith.bostic. Reopen if needed.

Comment by Keith Bostic (Inactive) [ 24/Feb/22 ]

jordi.olivares-provencio, I think SERVER-63322 can be closed for now. We can always reopen it if necessary, and there's no reason for it to clog up your backlog without a clear expectation there is work to do here. – Thank you!

Comment by Jordi Olivares Provencio [ 24/Feb/22 ]

Moving this back to the backlog in TODO as we need to wait on WT's decision

Comment by David Holland [ 16/Feb/22 ]

Re reads, interesting. I was working from the following premises:

  • Because WT doesn't track reads in detail there's a rule against preparing (or committing) at or before any read timestamp that's ever been used.
  • Consequently, if you read at time T (T > stable), you cannot prepare or commit anything further at times <= T, and you might as well move stable up before reading. It is allowed to do that while prepared transactions are still resolving, after all, because reads can't touch them.

I'm guessing there are distributed coordination reasons why it's not that simple, and I suppose in particular you can't move stable up until every node that might take over as coordinator agrees on it. (Probably that's the wrong terminology, but something like that...) Which means you specifically want to read from committed but not-yet-durable transactions, optimistically. (And unlike WT, which doesn't have any support for dependent transactions, you're prepared to deal with the consequences if the things you read later end up aborting.) So the proposed restriction sits right across an important optimization.

I need to chew on this , but I suspect there's a solution we can all live with...

Comment by Daniel Gottlieb (Inactive) [ 16/Feb/22 ]

That interleaving should not cause problems (unless you never move stable forward) because the new restriction only applies to writes by transactions whose durable timestamp is currently > stable.

Thanks for clarifying! I wasn't aware the stable timestamp was also being compared against to allow a document version to be read. I'll retract perpetuity claims as the stable timestamp does have to advance for prepared transactions to make progress and return acknowledgment to the client.

Since reading beyond stable is somewhat dubious anyhow I would expect this to be sufficient.

I disagree. MongoDB does that as part of speculative (majority) committed reads. For those, the waiting happens when the application commits its transaction to guarantee the data it reads/writes will not be rolled back. And because WT does not expose the timestamp information of the data read, we make a worst case guess for read-only transactions.

There are some reasons to prefer speculative majority reads:

  1. By waiting at the end, there's less latency overall as the waiting becomes "pipelined" with the conversational reads/writes an application is doing.
  2. Having writers use a "stale" read_timestamp results in unnecessary write conflict errors. For sharded transactions, the likelihood and cost of a write conflict is significant.
  3. A more perverse case of (2) is that a single application session may not see its own writes. Even though its Txn1 is "majority" committed, Txn2 may not see the effects of Txn1 and thus write conflict itself! The application receives acknowledgement after the decision was committed, but not necessarily when the data writes are considered committed.

Ordering restrictions aren't enough to avoid the problem, because the two transactions involved can have disjoint write sets; then no one key is accessed or updated out of order. I just posted a concrete example in WT-8747 if anyone else is interested.

Noted. Agreed that's a better place to understand the WT interleaving.

Comment by David Holland [ 16/Feb/22 ]

That interleaving should not cause problems (unless you never move stable forward) because the new restriction only applies to writes by transactions whose durable timestamp is currently > stable. So any timestamp before stable (and before any prepare still in progress) should be accepted. Since reading beyond stable is somewhat dubious anyhow I would expect this to be sufficient.

My changes failed to honor ignore_prepare; that's a bug, will fix. But please check that by setting it you aren't buying the problem back since ignore_prepare is inherently somewhat dangerous.

Ordering restrictions aren't enough to avoid the problem, because the two transactions involved can have disjoint write sets; then no one key is accessed or updated out of order. I just posted a concrete example in WT-8747 if anyone else is interested.

Comment by Daniel Gottlieb (Inactive) [ 15/Feb/22 ]

jordi.olivares-provencio, just because dbhash was the only problem our testing showed and it passes in ignore_prepared doesn't mean the proposed WT change provides semantics we guarantee today. Consider an application that inserts 2 documents into a collection and updates them with prepared transactions with the following interleaving:

| Writer 1                           | Writer 2                          |
|------------------------------------+-----------------------------------|
| BeginTxn                           |                                   |
| Update A                           |                                   |
| Prepare 10                         |                                   |
|                                    | BeginTxn                          |
|                                    | Update B                          |
|                                    | Prepare 20                        |
| Commit :durableTs 30 :commitTs: 10 |                                   |
| BeginTxn                           |                                   |
| Update A                           |                                   |
| Prepare 40                         |                                   |
|                                    | Commit :durableTs 50 :commitTs 20 |
|                                    | BeginTxn                          |
|                                    | Update B                          |
|                                    | Prepare 60                        |

If we extend that interleaving in perpetuity, every timestamp a reader that wants to see both A and B can choose would fall into a WT_UPDATE's range of commitTs -> durableTs on either A or B. The collection would be completely unavailable to those readers.

The premise where commitTs -> durableTs has a problematic window relies on the premise that the application (i.e: MongoDB) is breaking a contract. If we're in a world today where we are breaking that contract in some places, we should fix those. If we can't fix those, the proposed solution at least has legs. But I would still flag it as a major change in functionality that would likely require a lot of approval to push through.

Comment by Jordi Olivares Provencio [ 15/Feb/22 ]

Waiting until WT patches the requested functionality of respecting ignore_prepare so that it doesn't return prepare conflicts

Comment by Jordi Olivares Provencio [ 15/Feb/22 ]

louis.williams daniel.gottlieb On a second review of the code I believe that the timeout might not even be necessary as long as WT can accurately ignore Prepared Transactions when requested by the transaction.

As I mentioned in a previous comment, the dbhash function was the only one at issue with this causing a deadlock. The problem is that I now realise that the deadlock shouldn't have happened in the first place because dbHash explicitly ignores prepare conflicts too. So the WT_PREPARE_CONFLICT shouldn't have been returned in the first place.

keith.bostic I believe that as long as WT respects the ignore_prepare option in a transaction we shouldn't have to make modification in our code.

Comment by Louis Williams [ 15/Feb/22 ]

daniel.gottlieb this is a good point. The sequence described in WT-8747 writes with a durable timestamp 25 after an update with a durable timestamp of 30, which I would expect to trigger an assertion about out-of-order timestamps.

Comment by Daniel Gottlieb (Inactive) [ 14/Feb/22 ]

I don't think it's viable to make readers unavailable if their read timestamp is between the durable and commit timestamp of a WT_UPDATE. Our prepared transaction system is built on the premise that readers can choose any timestamp without regard to falling on some implementation detail boundary.

To write some claims down about how MDB uses commit + durable timestamps on adjacent WT_UPDATEs in an update chain: Txn_first happens before Txn_second

  1. CommitTS_first <= DurableTS_first
  2. CommitTs_second <= DurableTS_second
  3. DurableTS_first <= CommitTS_second
  4. By consequence of 2 and 3: DurableTS_first <= DurableTS_second

I expect the write_timestamp_usage=ordered setting to fire if any of those assertions fail.

Comment by Jordi Olivares Provencio [ 14/Feb/22 ]

Hi again keith.bostic, we discussed it internally and it seems that it is a possible fix we can do on our end. We'll preemptively be committing it into our codebase in preparation for your patch.

In parallel we discovered the second error present in our tests. We have logic on our end that is setting the ignore_prepare option in the transaction but it seems that it is being ignored in the current patch. This means that even though it specifies that WT_PREPARED_CONFLICT should not be an expected outcome it is receiving it and leading to issues. Would it be possible to fix it in your patch?

Comment by Keith Bostic (Inactive) [ 11/Feb/22 ]

jordi.olivares-provencio, thank you!

Is this a reasonable "real fix" for MDB Server? To the extent that ::WT_PREPARE_CONFLICT is resolved by a housekeeping thread advancing stable, it seems reasonable there would not necessarily be another concurrent transaction running.

Comment by Jordi Olivares Provencio [ 11/Feb/22 ]

Adding a 1 millisecond timeout to the lock seems to fix a great deal of these issues as can be seen here: https://spruce.mongodb.com/version/62064f15d6d80a25a22f5dda/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

Comment by Jordi Olivares Provencio [ 11/Feb/22 ]

Going by the error logs and the core dump I believe the concurrent dbHash command execution is to blame here. Looking at the core dumps there's a deadlock occurring in the DBHashCmd::_hashCollection method while iterating the cursor. It seems that the method on our end is checking if the result of iterating it returns a WT_PREPARE_CONFLICT and if so waits until another WUOW commits or aborts in order to try again. The problem is that in those tests this won't happen as there is no other concurrent transaction, thus leading to a deadlock.

Comment by Keith Bostic (Inactive) [ 09/Feb/22 ]

louis.williams, we now have a WiredTiger PR that returns ::WT_PREPARE_CONFLICT on this error and we've run the patch build. It's cleaner than before, but there is still a significant amount of fallout: https://spruce.mongodb.com/version/620312875623432d870f0ea5/tasks

If we could have someone help us understand where this change is causing problems for server, that would be terrific. – Thanks!

Comment by Keith Bostic (Inactive) [ 08/Feb/22 ]

louis.williams, it's a durability problem we're reasonably confident isn't a real problem in MongoDB server. We'd like to get it out of the way to eliminate another data consistency corner case, but we don't believe there are actual problems in the server.

It's currently stalled on us rewriting our patch to use ::WT_PREPARE_CONFLICT as the return failure code and re-running the patch build to identify remaining failures, and that will happen today.

Comment by Louis Williams [ 08/Feb/22 ]

keith.bostic, what is the current priority of this ticket?

Comment by Keith Bostic (Inactive) [ 08/Feb/22 ]

We've been talking this over inside the WiredTiger group, and we think a better way forward might be to return ::WT_PREPARE_CONFLICT instead of ::WT_ROLLBACK or ::WT_NOTFOUND. Preliminary testing indicates this approach leads to fewer problems, and so we're going to try implementing it for real and do some more testing, the results of which we should have ready for review in a day or so.

Generated at Thu Feb 08 05:57:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.