[SERVER-35798] Writing an oplog entry for prepare should not push the lastApplied timestamp Created: 25/Jun/18 Updated: 29/Oct/23 Resolved: 07/Aug/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Storage |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Xiangyu Yao (Inactive) | Assignee: | Judah Schvimer |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | prepare_basic | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||
| Steps To Reproduce: | commit: 8c5002b06fd737f75941a73785ce75e3ca7f5ce1
2. repro.js:
|
||||||||||||||||||||||||||||||||
| Sprint: | Repl 2018-07-16, Repl 2018-07-30, Repl 2018-08-13 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Linked BF Score: | 70 | ||||||||||||||||||||||||||||||||
| Description |
|
Our current prepareTransaction() logic might need some changes: currently, (1) we first log the prepare and get a timestamp T. And then (2) use this timestamp T to set the prepare timestamp in WiredTiger. Since (1) pushes the lastApplied timestamp to T and another transaction could start between (1) and (2), with lastApplied T as the read timestamp, it breaks the rule that prepare timestamp should be greater than any active read timestamp.
Here is what happened in BF-9432: |
| Comments |
| Comment by Githook User [ 07/Aug/18 ] |
|
Author: {'username': 'judahschvimer', 'name': 'Judah Schvimer', 'email': 'judah@mongodb.com'}Message: |
| Comment by Ian Whalen (Inactive) [ 01/Aug/18 ] |
|
since it doesn't send an email when a link is added, just a heads up that this bug is blocking us from finishing |
| Comment by Judah Schvimer [ 05/Jul/18 ] |
|
I talked to geert.bosch again and we came to the conclusion that Replication should choose a prepareTimetamp, then prepare the transaction, and then write the prepare oplog entry. This means that once the transaction writes the oplog entry it can only be aborted by the coordinator, since nothing after that point can fail before we return success to the user. There is a period of time when a read can come in after we choose the prepareTimestamp but before we've prepared the transaction. This will now be safe though, because any read that cares, which is any with readConcern: snapshot or readConcern: *, afterClusterTime: > prepareTimestamp will call waitForAllEarlierOplogWritesToBeVisible as of As such, I propose this ticket's work item to be to put the transaction into prepare before writing the prepare oplog entry. |
| Comment by Andy Schwerin [ 29/Jun/18 ] |
|
I believe I agree. Reads at "local" read concern don't offer any guarantees that require blocking when accessing documents in the "prepared" state. They can safely read what Judah refers to as the pre-image. |
| Comment by Judah Schvimer [ 29/Jun/18 ] |
|
It is unclear to me why local reads ever block on prepared transactions given the above reasoning. To be consistent, I would expect them to have the same behavior as "available" reads and see the pre-image. |
| Comment by Spencer Brody (Inactive) [ 28/Jun/18 ] |
|
We should consider whether the current behavior of local RC reads to potentially read a newer value and then an older one (by reading an operation that has committed, but where there are still uncommitted writes with early commit times in flight) is actually behavior as designed or is actually a bug. CC schwerin |
| Comment by Judah Schvimer [ 26/Jun/18 ] |
|
geert.bosch pointed out that this isn't actually a problem for normal reads. Local reads not at a timestamp already sometimes read behind committed transactions, and reading behind prepared transactions is not a problem. This is only a problem for reads at a timestamp. |
| Comment by Louis Williams [ 26/Jun/18 ] |
|
I think that may be doable. keith.bostic is there anything I may not be considering here? |
| Comment by Judah Schvimer [ 26/Jun/18 ] |
|
louis.williams, the ideal behavior would be to allow that behavior for prepare_transaction and shortly after set the prepare_timestamp and then only block for reads higher than that. |
| Comment by Louis Williams [ 26/Jun/18 ] |
|
Answering the question about WiredTiger prepare without a timestamp: prepare_transaction could be modified so that when no timestamp is passed, it will return WT_PREPARE_CONFLICT for all reads or updates on documents that are part of a prepared transaction. The difference between this and the current behavior is that now all reads on a document at any timestamp, even before the prepared timestamp, will introduce prepare conflicts. Like Spencer said, MongoDB does a blocking wait for prepared transactions to commit or abort when conflicts are encountered, so readers will have no choice but to wait for a transaction to commit before reading. |
| Comment by Xiangyu Yao (Inactive) [ 26/Jun/18 ] |
|
siyuan.zhou Say we just open a transaction B at 12, and now we try to prepare the previous transaction A with prepare timestamp 10, then we don't know the change made in transaction A should be visible to transaction B because we could later set the commit timestamp at 11, 12 or later for transaction A. geert.bosch, louis.williams Are there any other reasons prepare timestamp should be greater than any open read timestamps? |
| Comment by Spencer Brody (Inactive) [ 26/Jun/18 ] |
|
siyuan.zhou, it's important that the prepare timestamp be greater than the read timestamp of any running read transactions as all reads at timestamps greater than the prepare timestamp that read documents modified in the prepared transaction are supposed to block until the transaction is committed so they can know what the commit time is and thus whether they should read the pre or post image of the document. Without the invariant that the prepare timestamp is greater than the read timestamp of all running transactions then reads at timestamps higher than the prepare timestamp could read the document pre-image when they were supposed to return the post-image. I think the issue here isn't just that the prepare oplog entry pushes the lastApplied timestamp forward, as it'd always be possible for a new write to come in and push the lastApplied forward. So I think the only options are to either atomically allocate an optime and prepare the WUOW, while guaranteeing that no new optimes are handed out during that window, which is machinery we don't currently have and seems likely to introduce a perf hit, or to do what Judah proposed in the last comment - to effectively have two-phases to putting a WT transaction into prepare state. |
| Comment by Judah Schvimer [ 26/Jun/18 ] |
|
After talking to schwerin and spencer, one idea we think could work well is if we were able to call prepareTransaction() with no timestamp and block all reads on the affected documents, and later after writing the oplog entry and getting a prepareTimestamp we can set the prepareTimestamp and the blocking will begin to only happen on reads after the prepareTimestamp.geert.bosch and louis.williams, what are your thoughts? |
| Comment by Siyuan Zhou [ 26/Jun/18 ] |
|
xiangyu.yao, could you please explain why we disallow setting prepareTimestamp that's lower than or equal to the read timestamp of any running WT transactions? When discussing the proposal of reserving prepare timestamp, we were worried about the cost of WUOW::prepare() inside a mutex. Could you please shed some light on that? |
| Comment by Judah Schvimer [ 26/Jun/18 ] |
|
Potential solutions: |