[SERVER-40870] OpObserverImpl::onTransactionPrepare() reserves OplogSlots rather than using the one reserved by its caller Created: 26/Apr/19  Updated: 13/May/19  Resolved: 09/May/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Siyuan Zhou
Resolution: Won't Fix Votes: 0
Labels: bigtxns_packing
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-40797 Write each applyOps in its own WUOW f... Closed
Sprint: Repl 2019-05-06, Repl 2019-05-20
Participants:

 Description   

Large transactions will pack multiple operations into a single applyOps. The caller of OpObserver::onTransactionPrepare() won't know how many oplog slots are needed in advance. To avoid the bug in SERVER-35798, we still need to reserve an oplog slot in TransactionParticipant to prevent advancing lastApplied and prevent secondaries to read the prepare oplog entry before the transaction is actually prepared. OpObserverImpl can reserve and use its own OplogSlots to write packed oplog entries in side transactions. The prepare timestamp can be set by OpObserver or by TransactionParticipant with the getLastOp() on ReplClientInfo.

An alternative solution is to reserve more oplog slots than needed, e.g. one for each operations, but the first proposal imposes less restrictions on the caller and simplifies the OpObserver interface.



 Comments   
Comment by Siyuan Zhou [ 13/May/19 ]

I believe we are on the same page. In matthew.russotto's SERVER-40797, we write "partialTxn" oplog entries before calling prepare as you mentioned, then reserve the oplog slot for the last applyOps oplog entry with "prepare" field, set prepareTimestamp(), call prepare() and write the last entry. Setting prepareTimestamp() and calling prepare() are currently done in transaction participant and will be moved to OpObserver. That is a layer violation but is the simplest solution we can think of. An alternative would be adding a new OpObserver function to write "partialTxn" entries, which is a mush bigger layer violation.

Comment by Judah Schvimer [ 13/May/19 ]

it exposes a "prepared" state in the oplog which didn't exist on the primary

I agree, this would be a problem. This, however, only means we cannot write the final "prepare" oplog entry before calling prepare. We could still write "partialTxn" oplog entries if that were helpful.

The "partialTxn" entries don't need to reserve oplog slots. Only the last one for implicit prepare should which determines the prepareTimestamp.

How will we give the "partialTxn" entries optimes? Will they be written into the oplog before reserving the prepare oplog slot?

Comment by Siyuan Zhou [ 10/May/19 ]

Preparing a transaction will guarantee a subsequent commit will succeed. Only commit and rollback are allowed on a transaction after it has been prepared. The transaction prepare API is designed to support MongoDB exclusively, and guarantees update conflicts have been resolved, but does not guarantee durability.

According to the WT document, that seems to imply update conflicts will be returned. When prepare() fails, I'm wondering whether we can write the prepare entry. The contract of oplog is the corresponding operations should happen as if they are in the oplog order. I'm afraid even if an abort follows the prepare (perhaps after some concurrent ops), it exposes a "prepared" state in the oplog which didn't exist on the primary. That seems a violation of the contract.

The "partialTxn" entries don't need to reserve oplog slots. Only the last one for implicit prepare should which determines the prepareTimestamp.

Comment by Judah Schvimer [ 10/May/19 ]

hit the same write conflict

While prepare can fail, I don't think it can get a write conflict.

only reserve the oplog slot for the last applyOps

Can you clarify how this will work in more depth?

Comment by Siyuan Zhou [ 09/May/19 ]

Discussed with matthew.russotto, we cannot write oplog entries before calling prepare() on the WUOW since prepare() can fail. Once it fails, we cannot write oplog entry no matter we write an abort immediately after that or not, otherwise, secondaries will pick up the prepare entry and hit the same write conflict.

As part of SERVER-40797, we will move the WUOW::prepare() into OpObserver and only reserve the oplog slot for the last applyOps, which will has a "prepare" field in applyOps format.

Closing this as "Won't Fix".

Comment by Judah Schvimer [ 29/Apr/19 ]

I want to mention a third solution siyuan.zhou and I discussed and rejected: Construct the applyOps entries we will eventually log, reserve their optimes, and then log them with the correct optimes. We then do not need to reserve extra oplog slots. This exposes an undesirable amount of OpObserver behavior, however.

Generated at Thu Feb 08 04:56:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.