[SERVER-40898] Transaction table updates may be applied out of order when retryable writes and transactions are in the same secondary batch Created: 29/Apr/19  Updated: 29/Oct/23  Resolved: 16/May/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.12

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: Blake Oler
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-39844 Create concurrency workload with migr... Closed
Related
related to SERVER-34651 Performance regression on secondary a... Closed
related to SERVER-39793 Update the state in transaction table... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

 
load("jstests/libs/write_concern_util.js");
 
let rst = new ReplSetTest({nodes: 2});
rst.startSet();
rst.initiate();
 
let primary = rst.getPrimary();
let secondary = rst.getSecondary();
 
// Pause replication so ops are applied in a single batch on secondary.
stopServerReplication(secondary);
 
const session = primary.getDB("test").getMongo().startSession({causalConsistency: false});
const sessionDb = session.getDatabase("test");
const sessionColl = sessionDb["test"];
 
// Create collection.
sessionColl.insert({});
 
var k = 0;
jsTestLog("Running some retryable writes.");
for (var i = 0; i < 5; i++) {
    assert.commandWorked(primary.getDB("test").runCommand({
        insert: 'user',
        documents: [{x: k, retryableWrite: 1}],
        lsid: {id: session.getSessionId().id},
        txnNumber: NumberLong(k)
    }));
    k++
}
 
jsTestLog("Running some transactions.");
for (var i = 0; i < 5; i++) {
    assert.commandWorked(sessionDb.runCommand({
        insert: "test",
        documents: [{x: k}],
        readConcern: {level: "snapshot"},
        txnNumber: NumberLong(k),
        startTransaction: true,
        autocommit: false
    }));
    assert.commandWorked(sessionDb.adminCommand(
        {commitTransaction: 1, writeConcern: {w: 1}, txnNumber: NumberLong(k), autocommit: false}));
    k++
}
 
restartServerReplication(secondary);
// expect db hash mismatch on 'config.transactions'
rst.stopSet();

Sprint: Sharding 2019-05-06, Sharding 2019-05-20
Participants:
Linked BF Score: 0

 Description   

When applying operations on a secondary, we may need to update the config.transactions table for either retryable writes or multi-statement transactions. We will update the transactions table by producing "derived" operations in the SessionUpdateTracker. These ops represent writes to the config.transactions table and are scheduled onto oplog writer threads as normal ops. When we iterate through each operation in a batch and assign them to applier threads, we may defer the transactions table updates for retryable writes by storing them in a map from session ids to oplog entries. If we don't explicitly flush the session updates during the scheduling of ops inside the first call to SyncTail::_fillWriterVectors, then we will schedule those ops later on, after an explicit flush. For transactions table updates for multi-statement transaction oplog entries, however, we return derived ops immediately and schedule them right away.

The consequence of this is that the transaction table update for a retryable write may get applied after the transaction table update for a multi-statement transaction, even if the retryable write appeared before the multi-statement transaction in the oplog. We may consider fixing this by making multi-statement transactions updates to the transactions table deferred as well, but also making sure that the "in-progress" and "committed" writes are kept distinct.



 Comments   
Comment by Githook User [ 16/May/19 ]

Author:

{'name': 'Blake Oler', 'username': 'BlakeIsBlake', 'email': 'blake.oler@mongodb.com'}

Message: SERVER-40898 Immediately clear config.transactions table updates
for session upon generating update from multi statement transaction.
Branch: master
https://github.com/mongodb/mongo/commit/cf23836e06fad69ec55a4479ba27297682ce9c1c

Comment by William Schultz (Inactive) [ 01/May/19 ]

Yes, blake.oler, that approach sounds good to me. I believe that the original motivation for deferring the transaction table updates for retryable writes was to reduce write amplification for every one of these operations (see SERVER-34651). I don't think that flushing the deferred retryable writes at every new transaction op should be a problem in regards to performance here. If there are many transaction ops in a batch on the same session, we will already be incurring a fair amount of write amplification for that batch (which we can't avoid because we need the full history of transaction table writes in this case), so the extra flushes probably shouldn't hurt too much.

Furthermore, I would assume that we originally worried most about retryable write performance regressions in the cases where they replaced normal writes i.e. in 3.6 when they were first introduced. The performance characteristics of retryable writes and their interaction with transactions in 4.2 is less known and not well quantified at this point, so it's probably not something we should worry much about until it comes up as a concrete issue.

Comment by Blake Oler [ 01/May/19 ]

william.schultz, an approach for a JIRA LGTM:

Whenever a derived entry is generated from a transaction operation for a session, I would like to immediately clear the currently stored transaction table write for said session. This way we immediately write all generated transaction operation-related writes. If any retryable writes come in afterwards for a session, they will be deferred as expected, until another derived transaction operation write is generated. This means that we will not need to flush buffered writes for other sessions – any session that only has retryable writes in a batch will always have its transaction table entry deferred until the end.

I have a patch with this approach that works.

Comment by William Schultz (Inactive) [ 30/Apr/19 ]

Yes, my understanding after speaking with jason.chan about this was that we cannot batch the transaction table writes for multi-oplog entry transaction because we need the "in-progress" and "committed" writes to the transaction table to be distinct i.e. have different timestamps. As you said, we need to keep the history of these writes on secondary. My point in the ticket description was that we could consider deferring transactions table writes for both retryable writes and transactions so that we maintain the proper ordering between them. We would need a new mechanism, however, for storing a chain of deferred writes for each session, as opposed to a single deferred write, as we do now in the _sessionsToUpdate map. One possible solution would be to make that map of type LogicalSessionIdMap<std::vector<OplogEntry>> instead of LogicalSessionIdMap<OplogEntry>. That would allow us, for example, to store both an "in-progress" and a "committed" transaction table update on the same session. That is, the set of deferred transaction table writes for each session becomes a sequence of writes as opposed to a single write. Presumably we could also optimize away the retryable write updates if necessary. For example, when a transaction op comes in to a session that already has a deferred transaction table update for a retryable write, we can erase the retryable write update from the chain and replace it with the newer transaction update. That was one of my initial thoughts, but there may be better, simpler alternatives. As you said, flushing all previously buffered writes whenever we hit a multi-statement transaction op may actually be the easiest solution that fixes this problem.

Comment by Siyuan Zhou [ 30/Apr/19 ]

william.schultz, could you please clarify the last piece of your proposal?

making sure that the "in-progress" and "committed" writes are kept distinct.

I'm afraid we can not delay transaction table writes for transactions because we need the full history of timestamped writes in transaction table. The point to delay transaction table writes for retryable writes is because want to merge the writes on the same session which we cannot do for transactions. It's fine for retryable writes to loss the history on rollback but not safe for transactions.

As a solution, we need flush or clear buffered writes once we are about to write transaction table for transactions.

Comment by William Schultz (Inactive) [ 30/Apr/19 ]

One additional thing to note here is that the underlying bug caused oplog entries that touch the same document to be applied out of timestamp order on the secondary. In steady state replication, for a single applier thread, I believe it should always hold true that oplog entries are applied in timestamp order, due to the fact that we iterate through a batch in timestamp order and schedule each op sequentially. It may be helpful to add an invariant (perhaps an fassert) in SyncTail that enforces this. It could help catch bugs like this in the future more quickly. The implementation of this bug fix may affect whether or not we can enforce this invariant strictly.

Comment by William Schultz (Inactive) [ 29/Apr/19 ]

blake.oler originally discovered this bug in a patch build failure for a concurrency_sharded_multi_stmt_txn_with_stepdowns suite.

Generated at Thu Feb 08 04:56:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.