[SERVER-35798] Writing an oplog entry for prepare should not push the lastApplied timestamp Created: 25/Jun/18  Updated: 29/Oct/23  Resolved: 07/Aug/18

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: 4.1.2

Type: Bug Priority: Major - P3
Reporter: Xiangyu Yao (Inactive) Assignee: Judah Schvimer
Resolution: Fixed Votes: 0
Labels: prepare_basic
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-35430 Abort a transaction if prepare fails Closed
Gantt Dependency
has to be done before SERVER-36023 Try to enable replica set transaction... Closed
Related
is related to SERVER-35821 readConcern:snapshot transactions nee... Closed
is related to SERVER-38499 Preparing transaction fails and trigg... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

commit: 8c5002b06fd737f75941a73785ce75e3ca7f5ce1
1. Put a 5 second sleep between (1) and (2) to increase the chance that another transaction starts in between.

      opObserver->onTransactionPrepare(opCtx);                                    
      lk.lock();                                                                  
      _checkIsActiveTransaction(lk, *opCtx->getTxnNumber(), true);                
      invariant(_txnState == MultiDocumentTransactionState::kPrepared);           
+     sleep(5);                                                                   
                                                                                  
      opCtx->getWriteUnitOfWork()->prepare(); 
 

2. repro.js:

const session = db.getMongo().startSession({causalConsistency: false});            
const sessionDB = session.getDatabase('test');                                     
sessionDB.a.insert({x: 1});                                                        
session.startTransaction({readConcern: {level: "snapshot"}});                      
assert.commandWorked(sessionDB.runCommand({                                        
    update: 'a',                                                                   
    updates: [{q: {x: 1}, u: {$inc: {x: 1}}}],                                     
}));                                                                               
                                                                                   
let awaitShell = startParallelShell(function() {                                   
    const session = db.getMongo().startSession({causalConsistency: false});        
    const sessionDB = session.getDatabase('test');                                 
    while (1) {                                                                    
        session.startTransaction({readConcern: {level: "snapshot"}});              
        sessionDB.a.find().toArray();                                              
        session.abortTransaction();                                                
    }                                                                              
});                                                                                
                                                                                   
assert.commandWorked(sessionDB.adminCommand({prepareTransaction: 1}));             
session.abortTransaction();                                                        
                                                                                   
awaitShell(); 

Sprint: Repl 2018-07-16, Repl 2018-07-30, Repl 2018-08-13
Participants:
Linked BF Score: 70

 Description   

Our current prepareTransaction() logic might need some changes: currently, (1) we first log the prepare and get a timestamp T. And then (2) use this timestamp T to set the prepare timestamp in WiredTiger. Since (1) pushes the lastApplied timestamp to T and another transaction could start between (1) and (2), with lastApplied T as the read timestamp, it breaks the rule that prepare timestamp should be greater than any active read timestamp.

 

Here is what happened in BF-9432:
1. A started a transaction, did the insertion, and tried to prepare.
2. A called prepareTransaction() in session.cpp. It first logged the prepare and got a timestamp T. This operation also pushed the lastApplied timestamp to T.
3. B started a transaction with a read, the read timestamp is set to the lastApplied which is T.
4. A now tried to set the prepare timestamp to T in WiredTiger. It will cause a WiredTiger error that prepare timestamp being equal to an active read timestamp.



 Comments   
Comment by Githook User [ 07/Aug/18 ]

Author:

{'username': 'judahschvimer', 'name': 'Judah Schvimer', 'email': 'judah@mongodb.com'}

Message: SERVER-35798 preallocate prepare timestamp
Branch: master
https://github.com/mongodb/mongo/commit/0ed9472dc8c2f85e957052f01336b1247c7a6e75

Comment by Ian Whalen (Inactive) [ 01/Aug/18 ]

since it doesn't send an email when a link is added, just a heads up that this bug is blocking us from finishing SERVER-36023, a part of the work to enable recoverable rollback for the inmem storage engine.

Comment by Judah Schvimer [ 05/Jul/18 ]

I talked to geert.bosch again and we came to the conclusion that Replication should choose a prepareTimetamp, then prepare the transaction, and then write the prepare oplog entry. This means that once the transaction writes the oplog entry it can only be aborted by the coordinator, since nothing after that point can fail before we return success to the user.

There is a period of time when a read can come in after we choose the prepareTimestamp but before we've prepared the transaction. This will now be safe though, because any read that cares, which is any with readConcern: snapshot or readConcern: *, afterClusterTime: > prepareTimestamp will call waitForAllEarlierOplogWritesToBeVisible as of SERVER-35821, which would wait for the prepare oplog entry to be visible before trying to do the read, and at that point the transaction will already be prepared.

As such, I propose this ticket's work item to be to put the transaction into prepare before writing the prepare oplog entry.

Comment by Andy Schwerin [ 29/Jun/18 ]

I believe I agree. Reads at "local" read concern don't offer any guarantees that require blocking when accessing documents in the "prepared" state. They can safely read what Judah refers to as the pre-image.

Comment by Judah Schvimer [ 29/Jun/18 ]

It is unclear to me why local reads ever block on prepared transactions given the above reasoning. To be consistent, I would expect them to have the same behavior as "available" reads and see the pre-image.

Comment by Spencer Brody (Inactive) [ 28/Jun/18 ]

We should consider whether the current behavior of local RC reads to potentially read a newer value and then an older one (by reading an operation that has committed, but where there are still uncommitted writes with early commit times in flight) is actually behavior as designed or is actually a bug.

CC schwerin

Comment by Judah Schvimer [ 26/Jun/18 ]

geert.bosch pointed out that this isn't actually a problem for normal reads. Local reads not at a timestamp already sometimes read behind committed transactions, and reading behind prepared transactions is not a problem. This is only a problem for reads at a timestamp. SERVER-35821 should fix this.

Comment by Louis Williams [ 26/Jun/18 ]

I think that may be doable. keith.bostic is there anything I may not be considering here?

Comment by Judah Schvimer [ 26/Jun/18 ]

louis.williams, the ideal behavior would be to allow that behavior for prepare_transaction and shortly after set the prepare_timestamp and then only block for reads higher than that.

Comment by Louis Williams [ 26/Jun/18 ]

Answering the question about WiredTiger prepare without a timestamp:

prepare_transaction could be modified so that when no timestamp is passed, it will return WT_PREPARE_CONFLICT for all reads or updates on documents that are part of a prepared transaction. The difference between this and the current behavior is that now all reads on a document at any timestamp, even before the prepared timestamp, will introduce prepare conflicts. Like Spencer said, MongoDB does a blocking wait for prepared transactions to commit or abort when conflicts are encountered, so readers will have no choice but to wait for a transaction to commit before reading.

Comment by Xiangyu Yao (Inactive) [ 26/Jun/18 ]

siyuan.zhou Say we just open a transaction B at 12, and now we try to prepare the previous transaction A with prepare timestamp 10, then we don't know the change made in transaction A should be visible to transaction B because we could later set the commit timestamp at 11, 12 or later for transaction A.
If the commit timestamp is 11/12, the change should be visible to transaction B.
If the commit timestamp is greater than 12, the change should not be visible to transaction B.

geert.bosch, louis.williams Are there any other reasons prepare timestamp should be greater than any open read timestamps?

Comment by Spencer Brody (Inactive) [ 26/Jun/18 ]

siyuan.zhou, it's important that the prepare timestamp be greater than the read timestamp of any running read transactions as all reads at timestamps greater than the prepare timestamp that read documents modified in the prepared transaction are supposed to block until the transaction is committed so they can know what the commit time is and thus whether they should read the pre or post image of the document. Without the invariant that the prepare timestamp is greater than the read timestamp of all running transactions then reads at timestamps higher than the prepare timestamp could read the document pre-image when they were supposed to return the post-image.

I think the issue here isn't just that the prepare oplog entry pushes the lastApplied timestamp forward, as it'd always be possible for a new write to come in and push the lastApplied forward. So I think the only options are to either atomically allocate an optime and prepare the WUOW, while guaranteeing that no new optimes are handed out during that window, which is machinery we don't currently have and seems likely to introduce a perf hit, or to do what Judah proposed in the last comment - to effectively have two-phases to putting a WT transaction into prepare state.

Comment by Judah Schvimer [ 26/Jun/18 ]

After talking to schwerin and spencer, one idea we think could work well is if we were able to call prepareTransaction() with no timestamp and block all reads on the affected documents, and later after writing the oplog entry and getting a prepareTimestamp we can set the prepareTimestamp and the blocking will begin to only happen on reads after the prepareTimestamp.geert.bosch and louis.williams, what are your thoughts?

Comment by Siyuan Zhou [ 26/Jun/18 ]

xiangyu.yao, could you please explain why we disallow setting prepareTimestamp that's lower than or equal to the read timestamp of any running WT transactions? When discussing the proposal of reserving prepare timestamp, we were worried about the cost of WUOW::prepare() inside a mutex. Could you please shed some light on that?

Comment by Judah Schvimer [ 26/Jun/18 ]

Potential solutions:
1) reserve a prepare timestamp, set the prepare timestamp, and then log the oplog entry with that prepare timestamp.
cons: potential long period of time with an oplog hole
2) don't set last applied when processing prepare commands
cons: unclear what side effects this will have on the consensus protocol if any
siyuan.zhou and spencer, what do you think?

Generated at Thu Feb 08 04:41:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.