[SERVER-63322] Reading between commit_timestamp and durable_timestamp can produce inconsistency Created: 04/Feb/22 Updated: 27/Oct/23 Resolved: 24/Feb/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Keith Bostic (Inactive) | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Execution Team 2022-02-21, Execution Team 2022-03-07 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
There is a problem with prepared transactions whose durable timestamp is greater than their commit timestamp, which is that it's possible for another transaction to read them and commit before them, then get checkpointed without them, such that a crash causes the first transaction to disappear but not the second that depended on it. See This is a potential data consistency error. The ideal fix for WiredTiger is to return ::WT_ROLLBACK to the reader, which requires the reader to handle transaction rollback and which may not be possible for Server. We considered returning ::WT_NOTFOUND in this case (although that is problematical as well and could lead to its own problems), but patch builds returning the two different errors have roughly similar numbers of failures. The patch build returning ::WT_ROLLBACK is here: https://evergreen.mongodb.com/version/61fc61400305b973ad2bf953?redirect_spruce_users=true The patch build returning ::WT_NOTFOUND is here: https://evergreen.mongodb.com/version/61fd6248a4cf472d4bc3243c?redirect_spruce_users=true Can someone please help us investigate this problem? – Thanks! cc: geert.bosch, louis.williams |
| Comments |
| Comment by Henrik Edin [ 24/Feb/22 ] | ||||||||||||||||
|
Closed this ticket per your comment keith.bostic. Reopen if needed. | ||||||||||||||||
| Comment by Keith Bostic (Inactive) [ 24/Feb/22 ] | ||||||||||||||||
|
jordi.olivares-provencio, I think | ||||||||||||||||
| Comment by Jordi Olivares Provencio [ 24/Feb/22 ] | ||||||||||||||||
|
Moving this back to the backlog in TODO as we need to wait on WT's decision | ||||||||||||||||
| Comment by David Holland [ 16/Feb/22 ] | ||||||||||||||||
|
Re reads, interesting. I was working from the following premises:
I'm guessing there are distributed coordination reasons why it's not that simple, and I suppose in particular you can't move stable up until every node that might take over as coordinator agrees on it. (Probably that's the wrong terminology, but something like that...) Which means you specifically want to read from committed but not-yet-durable transactions, optimistically. (And unlike WT, which doesn't have any support for dependent transactions, you're prepared to deal with the consequences if the things you read later end up aborting.) So the proposed restriction sits right across an important optimization. I need to chew on this , but I suspect there's a solution we can all live with... | ||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 16/Feb/22 ] | ||||||||||||||||
Thanks for clarifying! I wasn't aware the stable timestamp was also being compared against to allow a document version to be read. I'll retract perpetuity claims as the stable timestamp does have to advance for prepared transactions to make progress and return acknowledgment to the client.
I disagree. MongoDB does that as part of speculative (majority) committed reads. For those, the waiting happens when the application commits its transaction to guarantee the data it reads/writes will not be rolled back. And because WT does not expose the timestamp information of the data read, we make a worst case guess for read-only transactions. There are some reasons to prefer speculative majority reads:
Noted. Agreed that's a better place to understand the WT interleaving. | ||||||||||||||||
| Comment by David Holland [ 16/Feb/22 ] | ||||||||||||||||
|
That interleaving should not cause problems (unless you never move stable forward) because the new restriction only applies to writes by transactions whose durable timestamp is currently > stable. So any timestamp before stable (and before any prepare still in progress) should be accepted. Since reading beyond stable is somewhat dubious anyhow I would expect this to be sufficient. My changes failed to honor ignore_prepare; that's a bug, will fix. But please check that by setting it you aren't buying the problem back Ordering restrictions aren't enough to avoid the problem, because the two transactions involved can have disjoint write sets; then no one key is accessed or updated out of order. I just posted a concrete example in | ||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 15/Feb/22 ] | ||||||||||||||||
|
jordi.olivares-provencio, just because dbhash was the only problem our testing showed and it passes in ignore_prepared doesn't mean the proposed WT change provides semantics we guarantee today. Consider an application that inserts 2 documents into a collection and updates them with prepared transactions with the following interleaving:
If we extend that interleaving in perpetuity, every timestamp a reader that wants to see both A and B can choose would fall into a WT_UPDATE's range of commitTs -> durableTs on either A or B. The collection would be completely unavailable to those readers. The premise where commitTs -> durableTs has a problematic window relies on the premise that the application (i.e: MongoDB) is breaking a contract. If we're in a world today where we are breaking that contract in some places, we should fix those. If we can't fix those, the proposed solution at least has legs. But I would still flag it as a major change in functionality that would likely require a lot of approval to push through. | ||||||||||||||||
| Comment by Jordi Olivares Provencio [ 15/Feb/22 ] | ||||||||||||||||
|
Waiting until WT patches the requested functionality of respecting ignore_prepare so that it doesn't return prepare conflicts | ||||||||||||||||
| Comment by Jordi Olivares Provencio [ 15/Feb/22 ] | ||||||||||||||||
|
louis.williams daniel.gottlieb On a second review of the code I believe that the timeout might not even be necessary as long as WT can accurately ignore Prepared Transactions when requested by the transaction. As I mentioned in a previous comment, the dbhash function was the only one at issue with this causing a deadlock. The problem is that I now realise that the deadlock shouldn't have happened in the first place because dbHash explicitly ignores prepare conflicts too. So the WT_PREPARE_CONFLICT shouldn't have been returned in the first place. keith.bostic I believe that as long as WT respects the ignore_prepare option in a transaction we shouldn't have to make modification in our code. | ||||||||||||||||
| Comment by Louis Williams [ 15/Feb/22 ] | ||||||||||||||||
|
daniel.gottlieb this is a good point. The sequence described in | ||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 14/Feb/22 ] | ||||||||||||||||
|
I don't think it's viable to make readers unavailable if their read timestamp is between the durable and commit timestamp of a WT_UPDATE. Our prepared transaction system is built on the premise that readers can choose any timestamp without regard to falling on some implementation detail boundary. To write some claims down about how MDB uses commit + durable timestamps on adjacent WT_UPDATEs in an update chain: Txn_first happens before Txn_second
I expect the write_timestamp_usage=ordered setting to fire if any of those assertions fail. | ||||||||||||||||
| Comment by Jordi Olivares Provencio [ 14/Feb/22 ] | ||||||||||||||||
|
Hi again keith.bostic, we discussed it internally and it seems that it is a possible fix we can do on our end. We'll preemptively be committing it into our codebase in preparation for your patch. In parallel we discovered the second error present in our tests. We have logic on our end that is setting the ignore_prepare option in the transaction but it seems that it is being ignored in the current patch. This means that even though it specifies that WT_PREPARED_CONFLICT should not be an expected outcome it is receiving it and leading to issues. Would it be possible to fix it in your patch? | ||||||||||||||||
| Comment by Keith Bostic (Inactive) [ 11/Feb/22 ] | ||||||||||||||||
|
jordi.olivares-provencio, thank you! Is this a reasonable "real fix" for MDB Server? To the extent that ::WT_PREPARE_CONFLICT is resolved by a housekeeping thread advancing stable, it seems reasonable there would not necessarily be another concurrent transaction running. | ||||||||||||||||
| Comment by Jordi Olivares Provencio [ 11/Feb/22 ] | ||||||||||||||||
|
Adding a 1 millisecond timeout to the lock seems to fix a great deal of these issues as can be seen here: https://spruce.mongodb.com/version/62064f15d6d80a25a22f5dda/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC | ||||||||||||||||
| Comment by Jordi Olivares Provencio [ 11/Feb/22 ] | ||||||||||||||||
|
Going by the error logs and the core dump I believe the concurrent dbHash command execution is to blame here. Looking at the core dumps there's a deadlock occurring in the DBHashCmd::_hashCollection method while iterating the cursor. It seems that the method on our end is checking if the result of iterating it returns a WT_PREPARE_CONFLICT and if so waits until another WUOW commits or aborts in order to try again. The problem is that in those tests this won't happen as there is no other concurrent transaction, thus leading to a deadlock. | ||||||||||||||||
| Comment by Keith Bostic (Inactive) [ 09/Feb/22 ] | ||||||||||||||||
|
louis.williams, we now have a WiredTiger PR that returns ::WT_PREPARE_CONFLICT on this error and we've run the patch build. It's cleaner than before, but there is still a significant amount of fallout: https://spruce.mongodb.com/version/620312875623432d870f0ea5/tasks If we could have someone help us understand where this change is causing problems for server, that would be terrific. – Thanks! | ||||||||||||||||
| Comment by Keith Bostic (Inactive) [ 08/Feb/22 ] | ||||||||||||||||
|
louis.williams, it's a durability problem we're reasonably confident isn't a real problem in MongoDB server. We'd like to get it out of the way to eliminate another data consistency corner case, but we don't believe there are actual problems in the server. It's currently stalled on us rewriting our patch to use ::WT_PREPARE_CONFLICT as the return failure code and re-running the patch build to identify remaining failures, and that will happen today. | ||||||||||||||||
| Comment by Louis Williams [ 08/Feb/22 ] | ||||||||||||||||
|
keith.bostic, what is the current priority of this ticket? | ||||||||||||||||
| Comment by Keith Bostic (Inactive) [ 08/Feb/22 ] | ||||||||||||||||
|
We've been talking this over inside the WiredTiger group, and we think a better way forward might be to return ::WT_PREPARE_CONFLICT instead of ::WT_ROLLBACK or ::WT_NOTFOUND. Preliminary testing indicates this approach leads to fewer problems, and so we're going to try implementing it for real and do some more testing, the results of which we should have ready for review in a day or so. |