[SERVER-38499] Preparing transaction fails and triggers invariant if chosen timestamp is not greater than WiredTiger's latest active read timestamp Created: 10/Dec/18 Updated: 29/Oct/23 Resolved: 25/Jan/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.8 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | Daniel Gottlieb (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | prepare_basic | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2018-12-17, Repl 2019-01-14, Storage NYC 2019-01-28 | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 77 | ||||||||||||||||||||||||||||||||||||||||
| Description |
|
To prepare a transaction, a slot is reserved in the oplog, and then the slot's opTime timestamp is used to prepare the transaction in WiredTiger. If a new WiredTiger transaction begins during this window (i.e. after reserving a time, before it is given to WT) and starts reading at a timestamp >= the reserved prepare timestamp, WiredTiger will fail to prepare the transaction because "the prepare timestamp must be later/greater than the latest active read timestamp", failing this invariant. Example crash (from indexed_insert_large_noindex.js failure in this evergreen patch):
|
| Comments |
| Comment by Githook User [ 25/Jan/19 ] | |||||||||||||||||
|
Author: {'username': 'dgottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'name': 'Daniel Gottlieb'}Message: WiredTiger guards against transactions preparing with a timestamp The oplog is a special case. Because the oplog does not contain However, WiredTiger is not aware the oplog is special. When MongoDB There were two strategies available for resolving this. The first is | |||||||||||||||||
| Comment by Judah Schvimer [ 16/Jan/19 ] | |||||||||||||||||
|
After discussion, daniel.gottlieb will make sure that oplog readers don’t do timestamped reads, so that they don’t contribute to the latest active read timestamp. This won’t have any effect on majority commit latency.
This should enable us to keep the invariant exactly as it exists today. | |||||||||||||||||
| Comment by Judah Schvimer [ 13/Jan/19 ] | |||||||||||||||||
|
agorrod, do you have any thoughts on the above options, especially regarding relaxing the invariants in WT? | |||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 07/Jan/19 ] | |||||||||||||||||
|
I hadn't thought about the fact that for Option 7, the invariant would have to differ on primaries and secondaries, so it couldn't be done in the storage layer without additional inputs. Thank you for thinking it through. Now I don't like that choice as much. Now I prefer Option 2 again. Or Option 5, but it seems like a lot of work. daniel.gottlieb also pointed out that Option 6 would slow majority write acknowledgment even when chaining is not used, since majority acknowledgment does not occur until the write is behind the all_committed on a majority of nodes. | |||||||||||||||||
| Comment by Judah Schvimer [ 07/Jan/19 ] | |||||||||||||||||
|
tess.avitabile, daniel.gottlieb, and I discussed this and here are our thoughts and conclusions. There are two problems. daniel.gottlieb points out that there is no fundamental reason why local and majority multi-document transactions are given a read timestamp. They already do not read at a snapshot (since they do not read at all_committed ) so they may as well read the most recent data like a normal local read. Read-only transactions would just have to consult the system last optime to see what timestamp to wait for write concern on, rather than looking at the read timestamp. This means that on primaries all reads at a timestamp would specify ignore_prepared=false. This doesn't work on secondaries though because oplog readers must read at a timestamp. To solve the secondary case there are a few options. Option 7 seems like the best way forward. tess.avitabile and daniel.gottlieb, please provide any thoughts. | |||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 04/Jan/19 ] | |||||||||||||||||
Oplog readers are assigned a read timestamp at the no holes point, as tracked by storage. Secondary oplog application can mess with storage's notion of the no holes point because transactions don't begin in timestamp order (compared to primary's where transactions may not commit in order, a simpler problem). I'm still curious if it's actually necessary to assign a read timestamp to (recovery unit) transactions if those timestamps can be ahead of the no-holes point. | |||||||||||||||||
| Comment by Judah Schvimer [ 03/Jan/19 ] | |||||||||||||||||
|
I filed | |||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 03/Jan/19 ] | |||||||||||||||||
|
I believe we also still need to relax the invariant to say that a transaction must not be prepared behind the read timestamp of any storage transaction with ignore_prepared=false. | |||||||||||||||||
| Comment by Judah Schvimer [ 03/Jan/19 ] | |||||||||||||||||
|
Ok, thanks for the correction. So it seems that the only work item here is fixing ignore_prepared for multi-document transactions with local and majority read concerns. daniel.gottlieb and tess.avitabile, please correct me if I'm missing anything. | |||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 03/Jan/19 ] | |||||||||||||||||
|
For transactions with local and majority read concern, we read from the lastApplied, rather than the all-committed. We set the timestamp read source here based on the original read concern, which is obtained here. It is important that transactions with local and majority read concern read from the lastApplied so that back-to-back transactions can read their own writes, as described in | |||||||||||||||||
| Comment by Judah Schvimer [ 02/Jan/19 ] | |||||||||||||||||
|
Ah, you're referring to multi-document transactions with local read concern, not storage-transactions with local read concern. We would only need to fix that bug when we actually support local and majority read concern for transactions. Right now since everything is upconverted to snapshot, they should be reading at a time earlier than the all-committed. If not, then we're not actually upconverting the read concern correctly and we're not actually providing snapshot isolation. | |||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 02/Jan/19 ] | |||||||||||||||||
|
We set ignore_prepared=false here for snapshot reads. However, this is based on the upconverted read concern, rather than the original readConcern. IIUC, this means we do not ignore prepare conflicts for transactions with readConcern local or majority. | |||||||||||||||||
| Comment by Judah Schvimer [ 02/Jan/19 ] | |||||||||||||||||
tess.avitabile, what bug are you referring to? This sounds like
daniel.gottlieb, why would oplog readers use snapshot isolation with a timestamp in the middle of the current batch? I would expect them to either read with local read concern or with snapshot isolation at "lastApplied" which would be at the end of the previous batch. | |||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 02/Jan/19 ] | |||||||||||||||||
|
A clarification. After talking with tess.avitabile, we realized the hypothesis for where we had "violating readers" in the system was incomplete. There were test failures on secondaries which I believe have the following sequence:
Due to oplog application first writing out to the oplog followed by applying the entry in separate transactions, oplog readers can slice between the two operations with a read timestamp that's ahead of the imminent prepare time. | |||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 02/Jan/19 ] | |||||||||||||||||
|
If we relax this invariant, we will need to fix the bug where transactions reading with local or majority readConcern set ignore_prepared=false. Transactions with local or majority readConcern read at the lastApplied, which may be ahead of the all-committed, so a transaction could prepare behind their read timestamp. | |||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 02/Jan/19 ] | |||||||||||||||||
|
I don't think I have full perspective on the problem at this time to recommend that removing/adding precision to the WT error is the right course of action. Creating a precise error check, I believe, adds more computation to a hot critical section. Removing the error check results in difficult bugs to make manifest and diagnose. A third option would be to have the precise error check in WT "diagnostic" builds and no error check otherwise. Though, I'm not sure if that's what our build configuration does on debug builds today and changing things might discover some complications. I accept your statement about why preparing behind some readers that set a read timestamps (with ignore_prepare) is fine, but I haven't been convinced of the following:
| |||||||||||||||||
| Comment by Judah Schvimer [ 02/Jan/19 ] | |||||||||||||||||
|
Are we in agreement then that the work here is to relax the WT invariant? daniel.gottlieb, since this invariant is in WT, should I assign this ticket to the storage team to relax the invariant? | |||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 28/Dec/18 ] | |||||||||||||||||
|
I've disabled tests that depend on this ticket being resolved in | |||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 28/Dec/18 ] | |||||||||||||||||
Ah, thanks siyuan.zhou. Now I follow. At least from my testing, WT does not allow preparing behind any transactions with a read timestamp, regardless of their ignore_prepared setting. I believe you're right; the direct way to implement a precise amount of leniency would require a separately maintained data structure. | |||||||||||||||||
| Comment by Siyuan Zhou [ 27/Dec/18 ] | |||||||||||||||||
|
daniel.gottlieb, yes, I believe this is the issue described in this ticket, assuming "Timestamp :commit 10 (creates hole)" means reserving the OplogSlot and calling RecoveryUnit::setPrepareTimestamp().
My understanding of Judah's comments is we can keep the invariant but only for reads with ignore_prepared=false. This probably means we'd have to track another timestamp for active reads with ignore_prepared=false in WT. | |||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 27/Dec/18 ] | |||||||||||||||||
|
I understand judah.schvimer and tess.avitabile are on vacation. cc siyuan.zhou The argument Judah gives about guarantees makes sense to me. To put those words into a concrete example, this would be an expected sequence of commands/outcomes:
siyuan.zhou can you confirm this is the scenario this ticket is describing as a problem? If the invariant is removed, I'm still not sure WT would behave as required w.r.t the reader. | |||||||||||||||||
| Comment by Judah Schvimer [ 13/Dec/18 ] | |||||||||||||||||
For a local, available, or majority read without afterClusterTime we don't make any guarantees about what point in time you read at. If a storage-transaction for a regular update is in flight concurrently with the read, there are no guarantees about whether or not you see that update. The same applies for prepared transactions. Non-timestamped reads that occur concurrently with a prepared transaction should safely be able to read the pre or post-image of the transaction.
What shard version info are you referring to? And when would we assume snapshot semantics without providing snapshot read concern, or doing everything it does to ensure proper snapshot semantics? Can we keep the invariant there only for reads with ignore_prepared=false? This should be set correctly after The only alternative I can see to this is (from
| |||||||||||||||||
| Comment by Geert Bosch [ 12/Dec/18 ] | |||||||||||||||||
|
It seems to me that we still need the invariant. If we read before the prepare time with a normal read operation, say at the last applied time on a secondary, we would read a version that may or may not have been correct at our time. IIUC, you're saying that in most cases we don't really care about snapshot semantics, and this is OK. At the same time we're trying to use the fact that all reads happen at a specific time to be able to correlate shard version info. So, I'm concerned that in some cases we assume snapshot semantics where everything happens at a specific timestamp, and other cases we assume that we don't actually care about snapshot semantics. | |||||||||||||||||
| Comment by Geert Bosch [ 12/Dec/18 ] | |||||||||||||||||
|
judah.schvimerI'm not clear on all the intricacies here. Maybe we can discuss tomorrow? | |||||||||||||||||
| Comment by Judah Schvimer [ 12/Dec/18 ] | |||||||||||||||||
|
This was first discussed in
I think the outcome of this discussion that we didn't follow through on was relaxing the above invariant in WT. geert.bosch, do you agree? Should we move this to be a WT ticket to relax that invariant? |