[SERVER-44260] Transaction can conflict with previous transaction on the session if the all committed point is held back Created: 25/Oct/19  Updated: 29/Oct/23  Resolved: 21/Jan/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.2.4, 4.3.3

Type: Bug Priority: Major - P3
Reporter: Samyukta Lanka Assignee: Pavithra Vetriselvan
Resolution: Fixed Votes: 0
Labels: KP42
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-45430 Optimize the way a transaction waits ... Backlog
is related to SERVER-38028 Participant with prepared transaction... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2
Sprint: Repl 2019-11-18, Repl 2019-12-16, Repl 2019-12-30, Repl 2020-01-13, Repl 2020-01-27
Participants:
Linked BF Score: 7

 Description   

After SERVER-38028, participants block requests for higher txn numbers instead of failing them. So, a transaction started with read concern snapshot following a prepared transaction could have its read timestamp held back due to oplog holes (since the read timestamp will be set to the all committed point).

The following scenario describes the bug:

  • Thread 1 prepares txn0 at time 5
  • Thread 2 starts new txn1 that blocks on txn0 since it is on the same session
  • Thread 3 opens oplog hole at time 8
  • Thread 1 commits txn0 at time 6, but commit oplog entry (and txn table update) written at time 9
  • On thread 2, txn1 opens storage transaction at all_durable, which would be time 7 since there is an oplog hole at time 8
  • txn1 gets a write conflict when writing to the txn table bc it's reading from time 7 and doesn't see the write from time 9


 Comments   
Comment by Githook User [ 21/Jan/20 ]

Author:

{'email': 'pavithra.vetriselvan@mongodb.com', 'name': 'Pavithra Vetriselvan', 'username': 'pvselvan'}

Message: SERVER-44260 New transaction should wait for previous txn table update to be in snapshot
Branch: master
https://github.com/mongodb/mongo/commit/41b3d7da7c763e304bc8f4056d0d31d200742e0b

Comment by Pavithra Vetriselvan [ 13/Jan/20 ]

This is just waiting for SERVER-45500 to go in!

Comment by Pavithra Vetriselvan [ 12/Dec/19 ]

Yep, confirmed with sharding that the coordinator send commitTransaction with w: majority.

judah.schvimer and I investigated this yesterday and figured out that the test is running these transactions in parallel shells. Participants of a cross-shard transaction will block requests for higher txnNumbers (not fail them). In this scenario, the second transaction starts when the first exits prepare, not when it majority commits. This transaction can start while there is a hole, caused by some arbitrary parallel writer, behind the commit oplog entry.

I've updated the ticket description with a detailed scenario. Thanks Judah for the help!

Comment by Judah Schvimer [ 11/Dec/19 ]

the prepared commit should be using w: majority, right?

The commit decision has to be majority committed, but the coordinator does not need to majority commit the commitTransaction command on the participant. That said, the coordinator might choose to always do so, I am not sure.

Comment by Pavithra Vetriselvan [ 11/Dec/19 ]

siyuan.zhou lingzhi.deng That makes sense. Unfortunately, we don't have data files or a core dump from the BF where this behavior manifested. We do know that the first transaction was prepared though. Therefore, the prepared commit should be using w: majority, right? We also know that the second transaction, whose commit caused the conflict, was not prepared.

  • "...it should have waited for the holes before its commit time."

I'm not sure I understand this statement. If the update to the transactions table occurs at a different timestamp than the commit (since it is a separate write), isn't it still possible to reserve an oplog slot in between the two? If we made the transactions table write occur at the commitTS, then perhaps using w: majority would fix this.

Comment by Judah Schvimer [ 11/Dec/19 ]

Since we use speculative behavior for snapshot read concern, I don't think we need to wait for the previous transactions table update to be in the committed snapshot. I think we just need the previous transactions table update to be earlier than the all_durable timestamp. Calling waitForAllEarlierOplogWritesToBeVisible would ensure this, but would slow down back to back transactions since that waits for the all_durable timestamp to be greater than the latest timestamp in the oplog, which is a stronger condition than we need.

Can the TransactionParticipant keep track of the timestamp of the most recent transactions table write? We could then give waitForAllEarlierOplogWritesToBeVisible an optional timestamp to wait on instead of the latest timestamp in the oplog.

Comment by Jason Chan [ 11/Dec/19 ]

Note: We also have majority_writes_wait_for_all_durable_timestamp.js that tests that majority writes wait for the all committed (now called the all durable) timestamp on a single node replica set. This was initially added as part of SERVER-41769.

Comment by Lingzhi Deng [ 11/Dec/19 ]

I think even on a single node replset, w: majority also waits for holes. This is because we set the committed snapshot to stable timestamp (or min(lastCommitted, stableTimestamp) if eMRC = false) which shouldn't include holes.

Comment by Siyuan Zhou [ 11/Dec/19 ]

To understand the problem more, I'm wondering whether the previous transaction had majority write concern. If so, it should have waited for the holes before its commit time. It's true for a multi-node replset, but I'm not sure if a primary of a single node replset waits for holes on majority write concern. ldeng may know more about the behavior of majority write concern.

I'm also wondering if afterClusterTime is used in the following transaction. "afterClusterTime" should wait for all concurrent writes to be visible.

Comment by Pavithra Vetriselvan [ 10/Dec/19 ]

Based on the linked failure, we can hit this scenario any time an operation reserves an oplog slot before we update the transactions table. If we're starting a transaction with snapshot readConcern, its snapshot will not include that update. It seems like incorrect behavior for a new transaction to start without waiting for the previous transaction on the same session to update the transactions table.

I think the most ideal option would be to wait for the transactions table update to be in the committed snapshot before starting a newer transaction with snapshot readConcern. We do something similar when calculating the stableOpTime and notifying waiters. I was looking into the work needed for the transactionParticipant to wait on the txn table write before setting the readTimestamp, but the problem is that we don't know when the update to the transactions table will occur. Furthermore, this write does not generate an oplog entry, so I'm not sure how easy it would be to get the timestamp.

I originally considered making the write to the transactions table use writeConcern majority so that it would make it into the current committed snapshot (which is updated by the stableOpTime), but this would likely have a more significant impact on performance.

Another option would be to keep track of the previous transaction's (per session) update to the transaction table. This could even be something as simple as a boolean field "updatedTxnTable" and explicitly wait for this to be true before starting a newer transaction.

judah.schvimer siyuan.zhou any thoughts here? Is this the direction we'd like to take this ticket?

Comment by Jason Chan [ 28/Oct/19 ]

Yes, SERVER-42225 and SERVER-41769 addressed specific tests that this was happening and also an audit was performed on the existing tests at the time to make sure that any operations we expect to be in the snapshot are majority committed before starting the new transaction. Since this test was newly added after the audit, it wasn't covered by the aforementioned tickets.

Comment by Judah Schvimer [ 28/Oct/19 ]

This sounds like another example of SERVER-42225. jason.chan, is that right?

Generated at Thu Feb 08 05:05:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.