[SERVER-41178] TransactionHistoryIterator should do untimestamped reads during rollback recovery Created: 16/May/19  Updated: 29/Oct/23  Resolved: 29/May/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.13

Type: Task Priority: Major - P3
Reporter: Tess Avitabile (Inactive) Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by WT-4780 Enable assertion that commit timestam... Closed
Related
Backwards Compatibility: Fully Compatible
Sprint: Repl 2019-06-03
Participants:
Linked BF Score: 21

 Description   

If we are recovering, the lastApplied timestamp could be ahead of the common point in the case of rollback recovery because we do not update the lastAppliedOpTime until after we are done recovering the oplog. TransactionHistoryIterator by default does timestamped reads on the oplog using lastApplied. So this could race with the config.transactions table update (run by a different replication writer thread) whose commitTimestamp is less than the common point and thus less than the read timestamp of the TransactionHistoryIterator. WiredTiger enforces that the commit timestamp of non-prepared transactions (which is the case for config.transactions update) is newer than the latest active reader. See WT-4780. So, we need to make TransactionHistoryIterator read untimestamped to avoid violating WiredTiger rules. This is safe because there should be no concurrent write to the oplog and all oplog entries we need on the transaction oplog chain should also be visible under untimestamped reads.



 Comments   
Comment by Githook User [ 29/May/19 ]

Author:

{'email': 'lingzhi.deng@mongodb.com', 'name': 'Lingzhi Deng', 'username': 'ldennis'}

Message: SERVER-41178: TransactionHistoryIterator should do untimestamped reads during rollback recovery
Branch: master
https://github.com/mongodb/mongo/commit/97d1373931ac47dd2bc659dfd82a0086d58cf1dd

Comment by Alexander Gorrod [ 19/May/19 ]

Thanks for the write up samy.lanka - it's very helpful.

Comment by Samyukta Lanka [ 17/May/19 ]

I haven't had time to create a repro, but I think this test would fail consistently if we set a failpoint to hang the applier thread if the collection is the transactions table and let the other applier thread complete applying the commitTransaction first. That way, we could ensure that the latest active read timestamp is set "in the future" by the TransactionHistoryIterator before trying to commit the wuow for the update to the transactions table, thus ensuring that this WT assertion is triggered. Note, this repro should be done on a base commit from before WT removed the assertion to make the build green again.

Comment by Samyukta Lanka [ 17/May/19 ]

Took another look with judah.schvimer and we were able to determine the actual cause of the assertion failure.

The WT assertion fails not on the commitTransaction, but on the update to the config.transactions table. There is a race between the writer thread for the commitTransaction and the thread for the update to the transactions table. If the thread applying the commitTransaction goes first, it will do a timestamped read at the lastApplied timestamp while using the TransasctionHistoryIterator. But since we do not update the lastAppliedOpTime on the rolling back node until after we're done recovering the oplog, the read will use a timestamp that is ahead of the common point (and therefore in the future when replaying the oplog). When the other thread tries to commit the wuow for the update to the transactions table, WiredTiger will complain that the commit timestamp is before the latest active reader.

Our first idea for a solution is to set the lastAppliedOpTime on the rollback node when it recovers to the stable timestamp.

Another potential solution would be for the TransactionHistoryIterator to read without a timestamp. Because it’s an oplog read, it should be safe to not make it a timestamp read.

Comment by Tess Avitabile (Inactive) [ 17/May/19 ]

That's correct, judah.schvimer. The issue described in the ticket is likely not the cause of the assertion failure. samy.lanka will continue to investigate.

Comment by Judah Schvimer [ 17/May/19 ]

The stable_timestamp is null for the entirety of initial sync because all timestamps during initial sync are inconsistent, but we advance the oldest_timestamp with lastApplied.

We do set roundup_timestamps=true though before committing during replication recovery and initial sync, and interestingly I now see that we do put transactions in prepare before committing them there too. Is this not the problematic code path for this ticket?

Comment by Tess Avitabile (Inactive) [ 17/May/19 ]

Another important reason to make this change is to avoid committing an unprepared transaction behind the stable timestamp. When replaying a commit oplog entry in rollback recovery for a transaction that was prepared before the stable timestamp, the commit timestamp may also be behind the stable timestamp. Thus it's necessary to prepare the transaction with roundup_timestamps=true before committing. (Actually, I'm confused about how rollback_recovery_commit_transaction_before_stable_timestamp.js could ever succeed if we do not prepare the transaction before committing.)

samy.lanka, judah.schvimer, my understanding is this issue affects startup recovery as well. Does it also affect initial sync? That is, when a commit oplog entry for a prepared transaction is applied during initial sync, do we commit the transaction operations without putting them through the prepare stage and timestamp them with a time that may be behind oldest_timestamp?

Generated at Thu Feb 08 04:57:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.