[SERVER-38499] Preparing transaction fails and triggers invariant if chosen timestamp is not greater than WiredTiger's latest active read timestamp Created: 10/Dec/18  Updated: 29/Oct/23  Resolved: 25/Jan/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.8

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Daniel Gottlieb (Inactive)
Resolution: Fixed Votes: 0
Labels: prepare_basic
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-38832 Allow a transaction to be prepared be... Closed
depends on SERVER-38906 Multi-document transactions should no... Closed
Gantt Dependency
has to be done before SERVER-38569 Unblacklist remove_and_bulk_insert.js... Closed
has to be done before SERVER-40145 Re-enable test suites that exercise o... Closed
Related
related to SERVER-35798 Writing an oplog entry for prepare sh... Closed
related to SERVER-57443 [ephemeralForTest] fix oplog visibili... Closed
is related to SERVER-36382 only snapshot, linearizable, and afte... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2018-12-17, Repl 2019-01-14, Storage NYC 2019-01-28
Participants:
Case:
Linked BF Score: 77

 Description   

To prepare a transaction, a slot is reserved in the oplog, and then the slot's opTime timestamp is used to prepare the transaction in WiredTiger. If a new WiredTiger transaction begins during this window (i.e. after reserving a time, before it is given to WT) and starts reading at a timestamp >= the reserved prepare timestamp, WiredTiger will fail to prepare the transaction because "the prepare timestamp must be later/greater than the latest active read timestamp", failing this invariant.

Example crash (from indexed_insert_large_noindex.js failure in this evergreen patch): 

[ShardedClusterFixture:job0:shard1:primary] 2018-12-06T19:01:36.043+0000 E STORAGE  [conn13085] WiredTiger error (22) [1544122896:43869][47134:0x7f8d88774700], WT_SESSION.prepare_transaction: __wt_txn_parse_prepare_timestamp, 722: prepare timestamp 5C09721000000006 not later than an active read timestamp 5c09721000000007 : Invalid argument Raw: [1544122896:43869][47134:0x7f8d88774700], WT_SESSION.prepare_transaction: __wt_txn_parse_prepare_timestamp, 722: prepare timestamp 5C09721000000006 not later than an active read timestamp 5c09721000000007 : Invalid argument
[ShardedClusterFixture:job0:shard1:primary] 2018-12-06T19:01:36.043+0000 F -        [conn13085] Invariant failure: s->prepare_transaction(s, conf.c_str()) resulted in status BadValue: 22: Invalid argument at src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp 185
[ShardedClusterFixture:job0:shard1:primary] 2018-12-06T19:01:36.044+0000 F -        [conn13085]
[ShardedClusterFixture:job0:shard1:primary] 
[ShardedClusterFixture:job0:shard1:primary] ***aborting after invariant() failure
[ShardedClusterFixture:job0:shard1:primary] 
[ShardedClusterFixture:job0:shard1:primary] 
[ShardedClusterFixture:job0:shard1:primary] 2018-12-06T19:01:36.075+0000 F -        [conn13085] Got signal: 6 (Aborted).
[ShardedClusterFixture:job0:shard1:primary]  0x7f8dbfb638e1 0x7f8dbfb62af9 0x7f8dbfb62fdd 0x7f8dbc1307e0 0x7f8dbbdbf495 0x7f8dbbdc0c75 0x7f8dbe0d8f79 0x7f8dbe1a31ee 0x7f8dbfa9f6c1 0x7f8dbf07e573 0x7f8dbe394c45 0x7f8dbe395595 0x7f8dbe56e834 0x7f8dbe570ca0 0x7f8dbe5728be 0x7f8dbe57369f 0x7f8dbe56003a 0x7f8dbe56c6be 0x7f8dbe56822f 0x7f8dbe56b63d 0x7f8dbf2dbb42 0x7f8dbe565da8 0x7f8dbe569104 0x7f8dbe5673cc 0x7f8dbe5682c1 0x7f8dbe56b63d 0x7f8dbf2dc0b5 0x7f8dbfa9e9c4 0x7f8dbc128aa1 0x7f8dbbe75bdd
[ShardedClusterFixture:job0:shard1:primary] ----- BEGIN BACKTRACE -----
[ShardedClusterFixture:job0:shard1:primary] {"backtrace":[{"b":"7F8DBD672000","o":"24F18E1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"7F8DBD672000","o":"24F0AF9"},{"b":"7F8DBD672000","o":"24F0FDD"},{"b":"7F8DBC121000","o":"F7E0"},{"b":"7F8DBBD8D000","o":"32495","s":"gsignal"},{"b":"7F8DBBD8D000","o":"33C75","s":"abort"},{"b":"7F8DBD672000","o":"A66F79","s":"_ZN5mongo24invariantOKFailedWithMsgEPKcRKNS_6StatusERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES1_j"},{"b":"7F8DBD672000","o":"B311EE","s":"_ZN5mongo22WiredTigerRecoveryUnit17prepareUnitOfWorkEv"},{"b":"7F8DBD672000","o":"242D6C1","s":"_ZN5mongo15WriteUnitOfWork7prepareEv"},{"b":"7F8DBD672000","o":"1A0C573","s":"_ZN5mongo22TransactionParticipant18prepareTransactionEPNS_16OperationContextEN5boost8optionalINS_4repl6OpTimeEEE"},{"b":"7F8DBD672000","o":"D22C45"},{"b":"7F8DBD672000","o":"D23595"},{"b":"7F8DBD672000","o":"EFC834"},{"b":"7F8DBD672000","o":"EFECA0"},{"b":"7F8DBD672000","o":"F008BE"},{"b":"7F8DBD672000","o":"F0169F","s":"_ZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageERKNS0_5HooksE"},{"b":"7F8DBD672000","o":"EEE03A","s":"_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE"},{"b":"7F8DBD672000","o":"EFA6BE","s":"_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE"},{"b":"7F8DBD672000","o":"EF622F","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE"},{"b":"7F8DBD672000","o":"EF963D"},{"b":"7F8DBD672000","o":"1C69B42","s":"_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE"},{"b":"7F8DBD672000","o":"EF3DA8","s":"_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE"},{"b":"7F8DBD672000","o":"EF7104","s":"_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE"},{"b":"7F8DBD672000","o":"EF53CC","s":"_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE"},{"b":"7F8DBD672000","o":"EF62C1","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE"},{"b":"7F8DBD672000","o":"EF963D"},{"b":"7F8DBD672000","o":"1C6A0B5"},{"b":"7F8DBD672000","o":"242C9C4"},{"b":"7F8DBC121000","o":"7AA1"},{"b":"7F8DBBD8D000","o":"E8BDD","s":"clone"}],"processInfo":{ "mongodbVersion" : "4.1.6-32-gb18a6b96d0-patch-5c096a752a60ed2d89759d97", "gitVersion" : "b18a6b96d0c28bc98a1e16b7de0d5f104fd3c937", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "2.6.32-220.el6.x86_64", "version" : "#1 SMP Wed Nov 9 08:03:13 EST 2011", "machine" : "x86_64" }, "somap" : [ { "b" : "7F8DBD672000", "elfType" : 3, "buildId" : "57732CF5367F00101F6255B61CA5CD0571C83E54" }, { "b" : "7FFF242FF000", "elfType" : 3, "buildId" : "08F634A1D22DEFF00461D50A7699DACDC97657BF" }, { "b" : "7F8DBD235000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "F0BE1166EDCFFB2422B940D601A1BBD89352D80F" }, { "b" : "7F8DBCE50000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "1EDB45C205A844A75EBBB4F0075E705803FFB85B" }, { "b" : "7F8DBCBE4000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "D256E285C5E11D9A99EB04CA7651003A8F67B64E" }, { "b" : "7F8DBC9E0000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "1F7E85410384392BC51FA7324961719A10125F31" }, { "b" : "7F8DBC7D8000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "FDF3A36FFFE08375456D59DA959EAB2FC30B6186" }, { "b" : "7F8DBC554000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "8A852AC42F0B64F0F30C760EBBCFA3FE4A228F12" }, { "b" : "7F8DBC33E000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "EDC925E58FE28DCA536993EB13179C739F1E6566" }, { "b" : "7F8DBC121000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "85104ECFE42C606B31C2D0D0D2E5DACD3286A341" }, { "b" : "7F8DBBD8D000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "8E3AACE76351B6A83390CA065E904EB82FBD1EC7" }, { "b" : "7F8DBD44F000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "1CC2165E019D43F71FDE0A47AF9F4C8EB5E51963" }, { "b" : "7F8DBBB77000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "D053BB4FF0C2FC983842F81598813B9B931AD0D1" }, { "b" : "7F8DBB933000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "0C249DF4D77989253CCD859956BF50749308A16A" }, { "b" : "7F8DBB64C000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "624C7056B8BBE6BA758DEF557F516FBDBD01E1FD" }, { "b" : "7F8DBB448000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "57F77704A7F1F4E3689D028D3F9ADD4E77486EC9" }, { "b" : "7F8DBB21C000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "C81673692EEF670BC951EE726490F5D1CAB822F4" }, { "b" : "7F8DBB011000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "03B69EEB8998AC9CA7519A27571BAD976BA4C56D" }, { "b" : "7F8DBAE0E000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "3BCCABE75DC61BBA81AAE45D164E26EF4F9F55DB" }, { "b" : "7F8DBABEF000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "B4576BE308DDCF7BC31F7304E4734C3D846D0236" } ] }}
[ShardedClusterFixture:job0:shard1:primary]  mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7f8dbfb638e1]
[ShardedClusterFixture:job0:shard1:primary]  mongod(+0x24F0AF9) [0x7f8dbfb62af9]
[ShardedClusterFixture:job0:shard1:primary]  mongod(+0x24F0FDD) [0x7f8dbfb62fdd]
[ShardedClusterFixture:job0:shard1:primary]  libpthread.so.0(+0xF7E0) [0x7f8dbc1307e0]
[ShardedClusterFixture:job0:shard1:primary]  libc.so.6(gsignal+0x35) [0x7f8dbbdbf495]
[ShardedClusterFixture:job0:shard1:primary]  libc.so.6(abort+0x175) [0x7f8dbbdc0c75]
[ShardedClusterFixture:job0:shard1:primary]  mongod(_ZN5mongo24invariantOKFailedWithMsgEPKcRKNS_6StatusERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES1_j+0x0) [0x7f8dbe0d8f79]
[ShardedClusterFixture:job0:shard1:primary]  mongod(_ZN5mongo22WiredTigerRecoveryUnit17prepareUnitOfWorkEv+0x2DE) [0x7f8dbe1a31ee]
[ShardedClusterFixture:job0:shard1:primary]  mongod(_ZN5mongo15WriteUnitOfWork7prepareEv+0x31) [0x7f8dbfa9f6c1]
[ShardedClusterFixture:job0:shard1:primary]  mongod(_ZN5mongo22TransactionParticipant18prepareTransactionEPNS_16OperationContextEN5boost8optionalINS_4repl6OpTimeEEE+0x143) [0x7f8dbf07e573]
[ShardedClusterFixture:job0:shard1:primary]  mongod(+0xD22C45) [0x7f8dbe394c45]
[ShardedClusterFixture:job0:shard1:primary]  mongod(+0xD23595) [0x7f8dbe395595]



 Comments   
Comment by Githook User [ 25/Jan/19 ]

Author:

{'username': 'dgottlieb', 'email': 'daniel.gottlieb@mongodb.com', 'name': 'Daniel Gottlieb'}

Message: SERVER-38499: Enforce oplog visibility at the MongoDB layer.

WiredTiger guards against transactions preparing with a timestamp
earlier than the most recent reader. This guarantees no reader may
have seen the wrong version of a document.

The oplog is a special case. Because the oplog does not contain
prepared updates, and oplog readers cannot read from other
collections, it's valid to prepare behind an oplog readers
timestamp.

However, WiredTiger is not aware the oplog is special. When MongoDB
uses WiredTiger `read_timestamp`s to enforce oplog visibility, there
are cases (specifically, secondary oplog application) where an oplog
reader can be in front of an impending prepare.

There were two strategies available for resolving this. The first is
to artificially hold back what oplog is available to read at until
nothing can be prepared behind an oplog reader. The second strategy,
which is what this patch does, is to have the MongoDB layer hide
documents that are newer than the visibility point. The mechanism for
calculating and discovering the visibility point is unchanged.
Branch: master
https://github.com/mongodb/mongo/commit/5f213f2d419d9549559281fef7d3704ad7614d12

Comment by Judah Schvimer [ 16/Jan/19 ]

After discussion, daniel.gottlieb will make sure that oplog readers don’t do timestamped reads, so that they don’t contribute to the latest active read timestamp. This won’t have any effect on majority commit latency.

SERVER-38906 will ensure that all reads on a primary that have a timestamp occur behind the all_committed.

This should enable us to keep the invariant exactly as it exists today.

Comment by Judah Schvimer [ 13/Jan/19 ]

agorrod, do you have any thoughts on the above options, especially regarding relaxing the invariants in WT?

Comment by Tess Avitabile (Inactive) [ 07/Jan/19 ]

I hadn't thought about the fact that for Option 7, the invariant would have to differ on primaries and secondaries, so it couldn't be done in the storage layer without additional inputs. Thank you for thinking it through. Now I don't like that choice as much. Now I prefer Option 2 again. Or Option 5, but it seems like a lot of work.

daniel.gottlieb also pointed out that Option 6 would slow majority write acknowledgment even when chaining is not used, since majority acknowledgment does not occur until the write is behind the all_committed on a majority of nodes.

Comment by Judah Schvimer [ 07/Jan/19 ]

tess.avitabile, daniel.gottlieb, and I discussed this and here are our thoughts and conclusions.

There are two problems.
1) Local and Majority Multi-document Transactions read at lastApplied rather than all_committed, and transactions can be prepared at a timestamp less than the lastApplied, but must be greater than the all_committed since they create an oplog hole themselves.
2) Oplog readers on secondaries read at all_committed. After writing the oplog entries in a batch, all_committed can advance to the end of the batch. Thus an oplog reader can start reading at a timestamp greater than that of oplog entries that have been written but not applied. One of these oplog entries might try to prepare a transaction at a timestamp in the middle of the batch, even though there is an oplog reader open at the timestamp of the end of the batch.

daniel.gottlieb points out that there is no fundamental reason why local and majority multi-document transactions are given a read timestamp. They already do not read at a snapshot (since they do not read at all_committed ) so they may as well read the most recent data like a normal local read. Read-only transactions would just have to consult the system last optime to see what timestamp to wait for write concern on, rather than looking at the read timestamp. This means that on primaries all reads at a timestamp would specify ignore_prepared=false.

This doesn't work on secondaries though because oplog readers must read at a timestamp. To solve the secondary case there are a few options.
1) Relax the invariant completely: Allow a transaction to be prepared behind the read timestamp of any storage transactions. This one is easy but removes a valuable invariant.
2) Relax the invariant partially (SERVER-38832): Allow a transaction to be prepared behind the read timestamp of storage transactions with ignore_prepared=true (oplog readers set ignore_prepared=true. This is likely a hard invariant to construct requiring tracking extra data.
3) Relax the invariant on secondaries: Pass a parameter to prepare_transaction that says whether or not we want to check the invariant. The invariant is still something that we want to maintain, but this would be a middle ground between the above 2 options in terms of safety vs. amount of work.
4) Relax the invariant for oplog readers: This would require oplog readers to specify that they should not be included in the "read timestamp list". This may have unforeseen side effects and would make the oplog more special for better or for worse.
5) Maintain two different ideas of all_committed: one for data holes and one for oplog holes. This is a major change, backtracking a bit towards where we were in 3.4.
6) Delay moving forward all_committed to the end of a batch until the batch has been applied. tess.avitabile pointed out that this will slow down replication when chaining is in use and is an undesirable solution.
7) Replace the invariant with "Readers with ignore_prepared=false set must be reading behind the all_committed timestamp". This somewhat flips the invariant around, putting the onus on the readers rather than on the preparer. On primaries we could add the invariant that "Transactions must be prepared at a timestamp greater than or equal to the all_committed timestamp". Together these mean that "Prepared transactions must be prepared at a timestamp greater than ignore_prepared=false readers". On secondaries we could invariant that "Transaction must be prepared at a timestamp greater than or equal to lastApplied" and we could also invariant that on secondaries "Readers with ignore_prepared=false set must be reading behind the lastApplied timestamp". These two together would mean that Prepared transactions must be prepared at a timestamp greater than ignore_prepared=false readers".

Option 7 seems like the best way forward. tess.avitabile and daniel.gottlieb, please provide any thoughts.
CCing agorrod and geert.bosch as well.

Comment by Daniel Gottlieb (Inactive) [ 04/Jan/19 ]

Daniel Gottlieb, why would oplog readers use snapshot isolation with a timestamp in the middle of the current batch? I would expect them to either read with local read concern or with snapshot isolation at "lastApplied" which would be at the end of the previous batch.

Oplog readers are assigned a read timestamp at the no holes point, as tracked by storage.

Secondary oplog application can mess with storage's notion of the no holes point because transactions don't begin in timestamp order (compared to primary's where transactions may not commit in order, a simpler problem).

I'm still curious if it's actually necessary to assign a read timestamp to (recovery unit) transactions if those timestamps can be ahead of the no-holes point.

Comment by Judah Schvimer [ 03/Jan/19 ]

I filed SERVER-38832 to relax the invariant. This ticket will be for the replication work to correctly set "ignore_prepared".

Comment by Tess Avitabile (Inactive) [ 03/Jan/19 ]

I believe we also still need to relax the invariant to say that a transaction must not be prepared behind the read timestamp of any storage transaction with ignore_prepared=false.

Comment by Judah Schvimer [ 03/Jan/19 ]

Ok, thanks for the correction. So it seems that the only work item here is fixing ignore_prepared for multi-document transactions with local and majority read concerns. daniel.gottlieb and tess.avitabile, please correct me if I'm missing anything.

Comment by Tess Avitabile (Inactive) [ 03/Jan/19 ]

For transactions with local and majority read concern, we read from the lastApplied, rather than the all-committed. We set the timestamp read source here based on the original read concern, which is obtained here. It is important that transactions with local and majority read concern read from the lastApplied so that back-to-back transactions can read their own writes, as described in SERVER-38204. I agree that they are not providing snapshot isolation--converting the read concern to snapshot internally is simply convenient for the implementation. Since these transactions can read at a time later than the all-committed, they must have ignore_prepared=true or they will still trigger the invariant after the proposed change.

Comment by Judah Schvimer [ 02/Jan/19 ]

Ah, you're referring to multi-document transactions with local read concern, not storage-transactions with local read concern. We would only need to fix that bug when we actually support local and majority read concern for transactions. Right now since everything is upconverted to snapshot, they should be reading at a time earlier than the all-committed. If not, then we're not actually upconverting the read concern correctly and we're not actually providing snapshot isolation.

Comment by Tess Avitabile (Inactive) [ 02/Jan/19 ]

We set ignore_prepared=false here for snapshot reads. However, this is based on the upconverted read concern, rather than the original readConcern. IIUC, this means we do not ignore prepare conflicts for transactions with readConcern local or majority.

Comment by Judah Schvimer [ 02/Jan/19 ]

we will need to fix the bug where transactions reading with local or majority readConcern set ignore_prepared=false

tess.avitabile, what bug are you referring to? This sounds like SERVER-36382 which was already fixed.

oplog readers can slice between the two operations with a read timestamp that's ahead of the imminent prepare time

daniel.gottlieb, why would oplog readers use snapshot isolation with a timestamp in the middle of the current batch? I would expect them to either read with local read concern or with snapshot isolation at "lastApplied" which would be at the end of the previous batch.

Comment by Daniel Gottlieb (Inactive) [ 02/Jan/19 ]

A clarification. After talking with tess.avitabile, we realized the hypothesis for where we had "violating readers" in the system was incomplete. There were test failures on secondaries which I believe have the following sequence:

| Oplog Applier        | Oplog Manager                 | Oplog Reader              |
|----------------------+-------------------------------+---------------------------|
| Begin                |                               |                           |
| Write Oplog 10       |                               |                           |
| Timestamp :commit 10 |                               |                           |
| Commit               |                               |                           |
|                      | Update oplog visibility to 10 |                           |
|                      |                               | Begin :isolation snapshot |
|                      |                               | Timestamp :readAt 10      |
| Begin                |                               |                           |
| Write A 1            |                               |                           |
| Prepare 10           |                               |                           |

Due to oplog application first writing out to the oplog followed by applying the entry in separate transactions, oplog readers can slice between the two operations with a read timestamp that's ahead of the imminent prepare time.

Comment by Tess Avitabile (Inactive) [ 02/Jan/19 ]

If we relax this invariant, we will need to fix the bug where transactions reading with local or majority readConcern set ignore_prepared=false. Transactions with local or majority readConcern read at the lastApplied, which may be ahead of the all-committed, so a transaction could prepare behind their read timestamp.

Comment by Daniel Gottlieb (Inactive) [ 02/Jan/19 ]

I don't think I have full perspective on the problem at this time to recommend that removing/adding precision to the WT error is the right course of action. Creating a precise error check, I believe, adds more computation to a hot critical section. Removing the error check results in difficult bugs to make manifest and diagnose. A third option would be to have the precise error check in WT "diagnostic" builds and no error check otherwise. Though, I'm not sure if that's what our build configuration does on debug builds today and changing things might discover some complications.

I accept your statement about why preparing behind some readers that set a read timestamps (with ignore_prepare) is fine, but I haven't been convinced of the following:

  1. That the "violating" readers in the test failures were in fact configured with ignore_prepare=true.
  2. Why readers using ignore_prepare=true, that want to read in front of a no-holes point, require a read timestamp at all. If repl has already done the appropriate waiting for any causal relationships, does assigning a read timestamp, possibly in front of the no-holes point, provide some necessary semantics?
Comment by Judah Schvimer [ 02/Jan/19 ]

Are we in agreement then that the work here is to relax the WT invariant? daniel.gottlieb, since this invariant is in WT, should I assign this ticket to the storage team to relax the invariant?

Comment by Daniel Gottlieb (Inactive) [ 28/Dec/18 ]

I've disabled tests that depend on this ticket being resolved in SERVER-38783. This ticket (or a follow-on ticket) should re-enable the tests in etc/evergreen.yml annotated with SERVER-38499.

Comment by Daniel Gottlieb (Inactive) [ 28/Dec/18 ]

My understanding of Judah's comments is we can keep the invariant but only for reads with ignore_prepared=false. This probably means we'd have to track another timestamp for active reads with ignore_prepared=false in WT.

Ah, thanks siyuan.zhou. Now I follow. At least from my testing, WT does not allow preparing behind any transactions with a read timestamp, regardless of their ignore_prepared setting. I believe you're right; the direct way to implement a precise amount of leniency would require a separately maintained data structure.

Comment by Siyuan Zhou [ 27/Dec/18 ]

daniel.gottlieb, yes, I believe this is the issue described in this ticket, assuming "Timestamp :commit 10 (creates hole)" means reserving the OplogSlot and calling RecoveryUnit::setPrepareTimestamp().

If the invariant is removed, I'm still not sure WT would behave as required w.r.t the reader.

My understanding of Judah's comments is we can keep the invariant but only for reads with ignore_prepared=false. This probably means we'd have to track another timestamp for active reads with ignore_prepared=false in WT.

Comment by Daniel Gottlieb (Inactive) [ 27/Dec/18 ]

I understand judah.schvimer and tess.avitabile are on vacation. cc siyuan.zhou

The argument Judah gives about guarantees makes sense to me. To put those words into a concrete example, this would be an expected sequence of commands/outcomes:

| Client/Preparer                                     | Oplog Writer for Prepare            | Random Writer        | Reader                                        |
|-----------------------------------------------------+-------------------------------------+----------------------+-----------------------------------------------|
| Begin Txn                                           |                                     |                      |                                               |
| Insert A                                            |                                     |                      |                                               |
|                                                     | Begin Txn                           |                      |                                               |
|                                                     | Timestamp :commit 10 (creates hole) |                      |                                               |
|                                                     |                                     | Begin Txn            |                                               |
|                                                     |                                     | Timestamp :commit 20 |                                               |
|                                                     |                                     | Commit Txn           |                                               |
|                                                     |                                     |                      | Begin Txn                                     |
|                                                     |                                     |                      | Timestamp :readAt 20                          |
|                                                     |                                     |                      | <wait for all earlier to become visible (20)> |
| Prepare 10 (invariants, but Storage should succeed) |                                     |                      |                                               |
|                                                     | Insert oplog                        |                      |                                               |
|                                                     | Commit Txn (fills hole)             |                      |                                               |
|                                                     |                                     |                      | <Proceeds>                                    |
|                                                     |                                     |                      | Read A (blocks because it's prepared)         |

siyuan.zhou can you confirm this is the scenario this ticket is describing as a problem? If the invariant is removed, I'm still not sure WT would behave as required w.r.t the reader.

Comment by Judah Schvimer [ 13/Dec/18 ]

we would read a version that may or may not have been correct at our time

For a local, available, or majority read without afterClusterTime we don't make any guarantees about what point in time you read at. If a storage-transaction for a regular update is in flight concurrently with the read, there are no guarantees about whether or not you see that update. The same applies for prepared transactions. Non-timestamped reads that occur concurrently with a prepared transaction should safely be able to read the pre or post-image of the transaction.

At the same time we're trying to use the fact that all reads happen at a specific time to be able to correlate shard version info. So, I'm concerned that in some cases we assume snapshot semantics where everything happens at a specific timestamp, and other cases we assume that we don't actually care about snapshot semantics.

What shard version info are you referring to? And when would we assume snapshot semantics without providing snapshot read concern, or doing everything it does to ensure proper snapshot semantics?

Can we keep the invariant there only for reads with ignore_prepared=false? This should be set correctly after SERVER-36382.

The only alternative I can see to this is (from SERVER-35798):

One idea we think could work well is if we were able to call prepareTransaction() with no timestamp and block all reads on the affected documents, and later after writing the oplog entry and getting a prepareTimestamp we can set the prepareTimestamp and the blocking will begin to only happen on reads after the prepareTimestamp

Comment by Geert Bosch [ 12/Dec/18 ]

It seems to me that we still need the invariant. If we read before the prepare time with a normal read operation, say at the last applied time on a secondary, we would read a version that may or may not have been correct at our time. IIUC, you're saying that in most cases we don't really care about snapshot semantics, and this is OK. At the same time we're trying to use the fact that all reads happen at a specific time to be able to correlate shard version info. So, I'm concerned that in some cases we assume snapshot semantics where everything happens at a specific timestamp, and other cases we assume that we don't actually care about snapshot semantics.

Comment by Geert Bosch [ 12/Dec/18 ]

judah.schvimerI'm not clear on all the intricacies here. Maybe we can discuss tomorrow?

Comment by Judah Schvimer [ 12/Dec/18 ]

This was first discussed in SERVER-35798:

There is a period of time when a read can come in after we choose the prepareTimestamp but before we've prepared the transaction. This will now be safe though, because any read that cares, which is any with readConcern: snapshot or readConcern: *, afterClusterTime: > prepareTimestamp will call waitForAllEarlierOplogWritesToBeVisible as of SERVER-35821, which would wait for the prepare oplog entry to be visible before trying to do the read, and at that point the transaction will already be prepared.

I think the outcome of this discussion that we didn't follow through on was relaxing the above invariant in WT. geert.bosch, do you agree? Should we move this to be a WT ticket to relax that invariant?

Generated at Thu Feb 08 04:49:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.