[SERVER-39074] Update outside transaction erasing update from committed prepared transaction Created: 17/Jan/19  Updated: 29/Oct/23  Resolved: 19/Mar/19

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 4.1.10

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Louis Williams
Resolution: Fixed Votes: 0
Labels: prepare_basic, txn_storage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Gantt Dependency
has to be done before SERVER-38436 Add concurrency workload for majority... Closed
Related
related to WT-4580 Abort transactions that perform updat... Closed
related to SERVER-39996 Setting ignorePrepared should happen ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Run this concurrency workload in concurrency_sharded_replication.yml (may need to be repeated):

'use strict';
 
load('jstests/concurrency/fsm_libs/extend_workload.js');  // for extendWorkload
load('jstests/concurrency/fsm_workloads/multi_statement_transaction_simple.js');  // for $config
 
var $config = extendWorkload($config, function($config, $super) {
 
    $config.threadCount = 20;
    $config.iterations = 100;
 
    $config.data.counter = 0;
 
    /**
     * Updates a document that may be written to by the transaction run in the base workload.
     */
    $config.states.updateTxnDoc = function updateTxnDoc(db, collName) {
        this.counter += 1;
 
        // Choose a random document that may be written to by the base workload. The base collection
        // contains documents with _id ranging from 0 to the number of accounts.
        const transactionDocId = Random.randInt(this.numAccounts);
        const threadUniqueField = 'thread' + this.tid;
        assertWhenOwnColl.writeOK(db[collName].update({_id: transactionDocId},
                                                      {$set: {[threadUniqueField]: this.counter}}));
    };
 
    $config.transitions = {
        init: {transferMoney: 1},
        transferMoney: {transferMoney: 0.5, checkMoneyBalance: 0.1, updateTxnDoc: 0.4},
        checkMoneyBalance: {transferMoney: 0.5, updateTxnDoc: 0.5},
        updateTxnDoc: {transferMoney: 0.5, updateTxnDoc: 0.5},
    };
 
    return $config;
});

Sprint: Repl 2019-02-11, Repl 2019-02-25, Storage NYC 2019-03-25
Participants:

 Description   

In the attached concurrency workload, multi-statement transactions simulate transferring money between accounts by subtracting some amount from a document's "balance" field and adding it to another document's balance, periodically asserting the sum of all balances never changes (through a collection scan in a transaction with snapshot read concern).

The workload also concurrently updates other fields in these documents outside of a transaction and when run against a sharded cluster, it fails because the sum of all balances does change, implying one of the updates from a committed transaction was lost. Every thread observes the unexpected sum, so this doesn't seem to be an issue with reading at the wrong timestamp, and the non-transaction updates don't modify the "balance" field and use $set, not replacement updates.

This failure goes away if updates set ignore_prepare=false (currently set to true here), but from talking with the replication team, they don't believe this should be required for local updates. If that's true, then it's possible the non-transaction updates are somehow overwriting the updates of committed prepared transactions.

Example failure in evergreen.



 Comments   
Comment by Githook User [ 19/Mar/19 ]

Author:

{'email': 'louis.williams@mongodb.com', 'name': 'Louis Williams', 'username': 'louiswilliams'}

Message: SERVER-39074 All operations enforce prepare conflicts by default

Prepare conflicts may only be safely ignored when a command can
guarantee it does not perform writes. Prepare conflicts are ignored when
the read concern is local, available, or majority and the command is
aggregate, count, distinct, find, getMore, or group. Aggregate is a
special case because it may perform writes to an output collection, but
it enables prepare conflict enforcement before doing so.

Additionally, connections from a DBDirectClient inherit the
ignore_prepare state from their parent operation.
Branch: master
https://github.com/mongodb/mongo/commit/197233a97c2a8859b82ba1ffeac97ba2719f6470

Comment by Judah Schvimer [ 11/Mar/19 ]

One addendum: We discussed keeping the even stronger invariant that we cannot change ignorePrepared at all, but never setting ignorePrepared in DBDirectClient at all, true or false, and letting it inherit the parent value (or the default if none was set). We likely can skip waitForReadConcern entirely in DBDirectClient but that is out of scope of this work.

milkie suggested also making ignorePrepared a boost::optional and just using the storage-engine default if none needs to explicitly be specified based on the user provided read concern.

Comment by Samyukta Lanka [ 11/Mar/19 ]

Because of concerns surrounding ignorePrepared being set to true by default, we have come up with a new approach.

We will set the default for ignorePrepared to be false (meaning by default we will enforce prepare conflicts). Specific read commands with readConcern local, majority or available will set ignorePrepared to be true. The commands will be find, aggregate, count, group and distinct, but we will not allow DB direct clients to set ignorePrepared to true. This solution will still require the aggregation work around described above (setting ignorePrepared to true during the read stages and to false during the write stage). We will also keep the invariant that we cannot change ignorePrepared from true to false during an open storage transaction.

Comment by Samyukta Lanka [ 25/Feb/19 ]

What commands besides mapReduce?

mergeAuthzCollections is the only other command.

Also is WT going to provide an error if we try to ignore a prepare conflict in a transaction that does a write?

Yes, I believe that what is described in WT-4580 should work for this solution.

Comment by Judah Schvimer [ 25/Feb/19 ]

This sounds reasonable to me.

There are a couple of commands that still fail the invariant because they use DBDirectClient to do reads and writes on the same storage transaction, so we'll be explicitly splitting those storage transactions by calling abandonSnapshot on the recovery unit.

What commands besides mapReduce?

Also is WT going to provide an error if we try to ignore a prepare conflict in a transaction that does a write?

Comment by Samyukta Lanka [ 25/Feb/19 ]

There is actually a problem with the solution proposed above. I ran a patch build after implementing the whitelist for commands that set ignore_prepare=false and adding an invariant to make sure that we never try to change ignore_prepare after a storage transaction is started. This invariant was triggered quite often in the patch build due to the command whitelist. This is a problem because changing ignore_prepare after a storage transaction is started won't actually change the behavior of whether prepare conflicts are ignored or not.

The new idea for approaching this ticket is to use each command's supportsWriteConcern function to determine if it performs a write or not, and if it does, set ignore_prepare=false. We will approach aggregate pipelines the same as above by setting ignore_prepare=true during read stages and only setting ignore_prepare=false during the write stage. This will allow the reads to not block on prepare conflict, but make sure that the writes do not overwrite any writes from a prepared transaction. We also want to make sure that index creation will not block on prepared transactions as that would drastically increase the time taken to build an index. louis.williams suggested that for now we could set ignore_prepare=false for createIndex and a change could be made to index builds on a separate ticket to make sure that they won't block on prepare conflicts.

Does anyone know if there are any other commands that do writes where setting ignore_prepare=false would cause problems (like unnecessarily blocking reads on prepare conflicts)? I'm not sure if this could happen with collMod or convertToCapped.

This solution also hits the invariant that we don't change ignore_prepare after a storage transaction is started, but much less than the other solution. judah.schvimer and I decided that we could relax the invariant to only check if we are trying to change ignore_prepare to false after a storage transaction is started, since there isn't a correctness problem if we don't ignore prepared transactions. There are a couple of commands that still fail the invariant because they use DBDirectClient to do reads and writes on the same storage transaction, so we'll be explicitly splitting those storage transactions by calling abandonSnapshot on the recovery unit.

tess.avitabile judah.schvimer geert.bosch charlie.swanson Does this seem reasonable?

Comment by Andy Schwerin [ 15/Feb/19 ]

What about insert and delete?

On Thu, Feb 14, 2019, 3:57 PM Samyukta Lanka (Jira) <jira@mongodb.org>

Comment by Andy Schwerin [ 15/Feb/19 ]

Insert, too?

Comment by Samyukta Lanka [ 14/Feb/19 ]

After talking with charlie.swanson and judah.schvimer, we decided on creating a whitelist for write commands that need to set ignore_prepare=false. The whitelisted commands will be update, find_and_modify, remove, do_txn and apply_ops. In aggregate pipelines with $out, we will set ignore_prepare=false only for writes and keep ignore_prepare=true when reading.

This means that for WT-4580, we would need to guarantee that updates are not allowed after skipping prepared updates with ignore_prepare=true, because we won't be able to guarantee that no updates are allowed while ignore_prepare=true.

Comment by Louis Williams [ 13/Feb/19 ]

Discussed with judah.schvimer offline. He believed that WiredTiger would make the strong guarantee that no updates are allowed while ignore_prepare=true. My conclusion from this discussion, and the ticket I filed for WT-4580, was for the weaker guarantee that updates are not allowed after skipping prepared updates with ignore_prepare=true. I think the WiredTiger team can decide which route makes the most sense, and faster performance-wise, because MongoDB will stop performing any writes while ignore_prepare=true.

With either of these changes, we should still be able to add an invariant in writeConflictRetry that an operation that receives a WCE must not have RecoveryUnit::getIgnorePrepared() == true, a function we will need to add.

Comment by Judah Schvimer [ 13/Feb/19 ]

milkie, would returning a WT_ROLLBACK error lead to an infinite write-conflict retry loop, since an update in a prepared transaction will never be able to succeed?

Comment by Eric Milkie [ 13/Feb/19 ]

Specifically, the ticket says for the update to return WT_ROLLBACK status. This would be handled the same way we handle WT_ROLLBACK errors today, which is to throw a WriteConflict exception. It would then be handled by various logic and either retried or returned to the user, depending on the conditions.

Comment by Judah Schvimer [ 13/Feb/19 ]

louis.williams, WT-4580 calls for the update to fail, not an invariant. How will that manifest to a user if we have a bug that hits that?

Comment by Louis Williams [ 13/Feb/19 ]

My original suggestion to use an invariant in MongoDB won't work because it's legal to call RecoveryUnit::commit() without having performed any writes. I've opened WT-4580 for the fix in WiredTiger.

Comment by Judah Schvimer [ 11/Feb/19 ]

This work should test that all mongodb-transactions set ignorePrepared = false since all mongodb-transactions could do a write. This should already be the behavior, but tests to ensure that behavior is correct and maintained would be good.

Comment by Alexander Gorrod [ 28/Jan/19 ]

Sounds good - let me know if it makes sense to move the check inside WiredTiger after closer inspection.

Comment by Judah Schvimer [ 25/Jan/19 ]

Thanks louis.williams. If it's in the mongodb layer, the replication team can do it as part of this ticket with your team's guidance and review.

Comment by Louis Williams [ 25/Jan/19 ]

judah.schvimer Yes, I think it would be easy to, and we should, invariant in the WiredTigerRecoveryUnit that a transaction does not commit with _ignorePrepared == kIgnore.

Comment by Judah Schvimer [ 25/Jan/19 ]

alexander.gorrod and daniel.gottlieb, that sounds good to me as well. My one concern is that this means if we miss a case then we risk a data corruption bug. Would it be worthwhile/easy enough to add an invariant that we do not do a write in an ignore_prepare=true storage-transaction?

Comment by Tess Avitabile (Inactive) [ 25/Jan/19 ]

Yes, that sounds good to me.

Comment by Alexander Gorrod [ 25/Jan/19 ]

I wanted to summarize my reading of the conclusion above:

MongoDB will stop doing updates from transactions that set ignore_prepared=true. If there is a need to change that in the future - you will let the storage engines team know and we can schedule the work in WiredTiger to cause any transaction with ignore_prepared=true that skips a prepared update to abort as soon as it does an update.

Does that match your understanding?

Comment by Judah Schvimer [ 22/Jan/19 ]

If we ran transactions with weaker read concern, writes to prepared documents in the transaction would need to either block or abort the transaction with a WCE. I don't see how either of those gets more complicated than what we already have. I guess an alternative would be to get the behavior this ticket is describing when using weaker read concerns.

Comment by Andy Schwerin [ 22/Jan/19 ]

Might that get complicated is we want to start running MongoDB
transactions with weaker readconcern?

On Tue, Jan 22, 2019, 11:01 AM Judah Schvimer (Jira) <jira@mongodb.org>

Comment by Judah Schvimer [ 22/Jan/19 ]

ignore_prepared=false is currently only used by linearizable, snapshot, and afterClusterTime reads (SERVER-36382). It was my understanding that any other read concern would be able to (and thus want to for improved performance) see the pre-image of the prepared transaction. local and majority read concern transactions, currently set ignore_prepared=false since that decision happens after we upconvert to snapshot read concern.

Updates don't have a read concern, so they default to ignore_prepared=true. I thought they would receive a WCE, but that was a misunderstanding. We could make CRUD ops that do writes that could conflict with prepared transactions (updates, deletes, findAndModify) set ignore_prepared=false. I think that would be sufficient since locking takes care of DDL ops.

Comment by Tess Avitabile (Inactive) [ 22/Jan/19 ]

I think that either solution daniel.gottlieb described would be correct: (1) transactions that ignored a prepared update must not commit, or (2) transactions with ignore_prepare=true must not perform writes. I don't think we can assume that mixing local updates with sharded transactions will be rare, so it might be worth considering performance, and I buy daniel.gottlieb's argument that (2) will be more performant.

Comment by Andy Schwerin [ 22/Jan/19 ]

I suspect it is a bug in repl to use ignore_prepare=true in updates and deletes. I think it falls out of the behavior of local and majority read concern, which is probably treated the same for reads and writes at present. tess.avitabile .

Comment by Michael Cahill (Inactive) [ 21/Jan/19 ]

daniel.gottlieb, at this point I haven't tried to lay out a resolution because like you I'm not sure I understand the intent. We could indeed flag WT transactions so that if they ignore a prepared update, they cannot go on to commit.

I hope that mixing local updates with sharded transactions isn't common enough for performance considerations to determine the solution. Instead, I'm inclined to make it difficult for applications to lose writes when using default transaction and update semantics.

Let's wait to hear from the Replication team (and schwerin). If the intent is to have ignore_prepare=true as a common case for updates (as opposed to only enabled for "available" read concern), then I think we need to address this in the storage engine because I don't see any way for MongoDB to efficiently figure out that these operations have conflicted.

Comment by Daniel Gottlieb (Inactive) [ 21/Jan/19 ]

I'm not sure what the right compromise is; MongoDBs intentions for using ignore_prepare=true (for things outside of initial sync table scans) is not entirely clear to me.

[If an ignore_prepare=true transaction can make updates] and it attempts to update a document with a prepared update, is the expectation that it should fail with a WT_ROLLBACK at that point?

Can you clarify your statement some? In this example, at the time of the update, the document is no longer prepared. MongoDB making a derivative change relative to the preimage of a prepared update caused the problem. I think the resolution you're contemplating is:

If a writing WT transaction ever reads a document where ignore_prepare=true results in a returning a document (as opposed to a prepare conflict), then the WT transaction should return WT_ROLLBACK at its next earliest opportunity. Even if the write occurred before the offending read.

I'm not sure about the merits of MongoDB relying on such a contract. On the one hand I've heard MongoDB's strategy is to have prepared conflicts block inside storage because storage can be optimized for knowing when the conflict is resolved. However, using ignore_prepare=true and relying on storage to bubble up a WCE just pushes the retrying to the client (where over-waiting introduces latency and under-waiting burns cycles, assuming the client still cares about performing the transaction).

But perhaps codifying that resolution would be a relatively cheap way to enforce updates don't get dropped. Thoughts tess.avitabile, judah.schvimer?

Comment by Michael Cahill (Inactive) [ 21/Jan/19 ]

daniel.gottlieb, that explanation seems likely. I didn't expect ignore_prepare=true transactions to perform writes (because as you explain, they ignore updates by design and therefore can cause lost writes).

I vaguely remember some conversations about such transactions always being read-only, but clearly that didn't become part of the interface.

What is the expected behavior here: should a transaction configured with ignore_prepare=true ever be permitted to perform updates? If so, and it attempts to update a document with that already has a prepared update, is the expectation that it should fail with a WT_ROLLBACK at that point?

Comment by Daniel Gottlieb (Inactive) [ 18/Jan/19 ]

I have a hypothesis of what's happening. I believe the layer above storage implementing multi-statement transactions depends on the following storage engine behavior/contract:
When a WT-transaction using ignore_prepare=true returns a pre-image instead of blocking, it must be the case that writing to that document in the same transaction will fail with a WriteConflictException.

However, I don't think WiredTiger satisfies that property. I can't say if this was intentional (i.e: perhaps WT transactions using ignore_prepare=true should not be performing writes) or is a bug. The modification I made to my previous example was to commit the prepared transaction prior to the updater doing a write:

| Preparer          | Updater                   | Reader      |
|-------------------+---------------------------+-------------|
| Begin             |                           |             |
| Timestamp :read 5 |                           |             |
| Write A 10        |                           |             |
| Prepare 10        |                           |             |
|                   | Begin :ignorePrepare true |             |
|                   | Timestamp :read 15        |             |
|                   | Read A (WT_NOTFOUND)      |             |
| Commit :commit 12 |                           |             |
|                   | Read A (10)               |             |
|                   | Write A 20                |             |
|                   | Commit :commit 20         |             |
|                   |                           | Begin       |
|                   |                           | Read A (20) |

WT's rule for returning WT_ROLLBACK (a WriteConflictException) is roughly "If a writer wishes to modify a document, all updates to that document must be visible to the writer for the update to succeed." In this sequence, by committing the prepared transaction, the updater is able to see the new copy of the update, which implies its write can succeed (and is demonstrated).

Traditionally, due to the overlap between the Begin/Commit on the "Preparer" and "Updater" sessions, the updater's write would get a WCE. However, whether intentional or not, the act of a transaction going from a prepared -> committed state changes the evaluation of whether the transactions are considered concurrent.

alexander.gorrod vamsi.krishna michael.cahill

Comment by Daniel Gottlieb (Inactive) [ 18/Jan/19 ]

Because the bug seems to go away with ignore_prepare=false, I suspect WT/Storage Glue is correctly identifying which documents are in a prepare state. Because the manifestation described above can be explained by Txn #3 seeing the pre-image of a document currently prepared by Txn #2, I'd start looking more into how writeConflictRetry (the exception thrown in response to WT_ROLLBACK, which should* be happening as per the table in my first comment) could be interfering with a multi-statement transaction.

It might be interesting to further enhance the repro by having the transactions do a read on the documents they are going to update, and putting the contents of that read into a "preimage" field.

Comment by Daniel Gottlieb (Inactive) [ 18/Jan/19 ]

I looked into this a bit more, reproducing the error. I added logging to the clients to have a second log to reconcile against and determine which document(s) became corrupt. Then I checked the oplog results to ensure there were no updates to a document between prepare/commit (there weren't) and to determine which transaction(s) went bad. What I found:

In my repro run, there was $7 more in total than expected. The $7 was entirely attributed to one account (and, from my perspective, one occurrence of the bug). The first three transactions (from the client logs) against the account (all debits/subtractions) were for $153, $7 and $157. The first three versions of the documents were $1847, $1840 and $1690. The relative differences being $153, $7 and $150. To me that means that the third transaction calculated its "$inc: -157" against the $1847 version, instead of the $1840 version.

As for the additional, irrelevant updates this repro does, there were updates between all of the transactional writes. Furthermore, the first two transactions were prepared while the third one was a single-shard transaction.

Comment by Daniel Gottlieb (Inactive) [ 18/Jan/19 ]

louis.williams pointed me to this and I wanted to rule out that WT had a simple oversight regarding ignore_prepare. Specifically, I was questioning whether using ignore_prepare in a WT transaction writing to some document D would have its write succeed despite D being in a prepared state. I've tried a few permutations of interleaving the preparer and updater, but have not been able to produce a wrong result. E.g:

| Preparer          | Updater                             | Reader                                  |
|-------------------+-------------------------------------+-----------------------------------------|
| Begin             |                                     |                                         |
| Timestamp :read 5 |                                     |                                         |
| Write A 10        |                                     |                                         |
| Prepare 10        |                                     |                                         |
|                   | Begin :ignorePrepare true           |                                         |
|                   | Timestamp :read 15                  |                                         |
|                   | Write A 20 (-31800 WT_ROLLBACK) [1] |                                         |
|                   | Commit :commit 20                   |                                         |
|                   |                                     | Begin                                   |
|                   |                                     | Read A (-31808 WT_PREPARE_CONFLICT) [2] |
 
Footnotes:
  [1] conflict between concurrent operations
  [2] conflict with a prepared update

Generated at Thu Feb 08 04:50:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.