[SERVER-39680] Maintain the oldest active transaction timestamp only with the transaction table Created: 19/Feb/19  Updated: 29/Oct/23  Resolved: 07/Mar/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.9

Type: Task Priority: Major - P3
Reporter: Siyuan Zhou Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-39792 Update the txn table for the first op... Closed
is depended on by SERVER-40013 upgrade downgrade support for config.... Closed
is depended on by SERVER-40018 Remove ServerTransactionsMetrics::get... Closed
Duplicate
is duplicated by SERVER-39828 Track the first timestamp of transact... Closed
Related
related to SERVER-39829 Consider in-progress transactions whe... Closed
related to SERVER-39989 Use a config.transactions find comman... Closed
is related to SERVER-36494 Prevent oplog truncation of oplog ent... Closed
is related to SERVER-39679 Add callback to replication when stor... Closed
is related to SERVER-39813 Add the oldest required timestamp int... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2019-03-11
Participants:

 Description   

The current plan for Prepare Support for Transactions uses an in-memory data structure to maintain the oldest active transaction timestamp. Instead, we can store the first oplog entry's timestamp of a transaction in the transaction table. Thus the transaction table has all necessary information to calculate the oldest active transaction timestamp. With SERVER-39679, this calculation may be done only when a checkpoint is taken or an initial sync starts, so its performance isn't a big concern. The updates on the transaction table are timestamped, so calculating the oldest require timestamp is just a read at the checkpoint's timestamp.

If table scan's performance isn't good enough, we can add an index on the "firstTimestamp" field. The index can be a partial index on "firstTimestamp" when a new "active: true" field exists, so that retryable writes and finished transactions don't affect the performance.



 Comments   
Comment by Siyuan Zhou [ 19/Apr/19 ]

david.goldenSERVER-39813 is to track the work to expose the oldest active transaction timestamp. We still plan to the server status.

Comment by David Golden [ 19/Apr/19 ]

For mongomirror and mongodump, could someone please summarize how these tools should determine the oldest active transaction timestamp? The previous design discussions all assumed this would be available from serverStatus.

Comment by Judah Schvimer [ 07/Mar/19 ]

I filed SERVER-40013 for the upgrade-downgrade work.

Comment by Judah Schvimer [ 06/Mar/19 ]

The upgrade-downgrade logic is also still todo, but can be done in a follow-up ticket or later on.

Comment by Githook User [ 06/Mar/19 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-39680 Redundant test variable
Branch: master
https://github.com/mongodb/mongo/commit/8d09d4f70be8222f3e6818b19b5c678c18e9172e

Comment by Githook User [ 06/Mar/19 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-39680 Save start timestamp in config.transactions
Branch: master
https://github.com/mongodb/mongo/commit/74eb8f3c5ad4e9010bed2fb8f50044df2db67948

Comment by Judah Schvimer [ 25/Feb/19 ]

Thank you for the clarification. I may not have put that in the right implementation order, but I'll defer to you if you think this makes sense to do before finishing the parts required for "prepare". I don't expect the upgrade-downgrade logic here to be very difficult since SERVER-36498 did most of the heavy lifting, but I do think that will be the hardest part of this ticket.

Comment by A. Jesse Jiryu Davis [ 25/Feb/19 ]

I think that doing this ticket is part of the plan at the bottom of SERVER-36494, right?

> finding the OATT by doing a timestamped read on config.transactions will be significantly easier for “Larger Transactions”, and sidestep the open questions of the original OATT design in Appendix II of the Prepare Design. As a result, we are going to pivot to that solution to save net work.

I think this ticket has two parts: first, add and use a new field in the transactions table for tracking each transaction's oldest timestamp. Second, delete the in-memory data structure in ServerTransactionMetrics and use the transactions table exclusively. We have some freedom about the order in which we do the parts of this ticket and the parts of SERVER-36494, but I think they're interdependent.

Comment by Judah Schvimer [ 25/Feb/19 ]

jesse, is this required for SERVER-36494 or is this able to be follow-on work for compatibility with "Larger Transactions"?

Comment by Judah Schvimer [ 22/Feb/19 ]

This will need upgrade-downgrade logic similar to SERVER-36498.

Comment by A. Jesse Jiryu Davis [ 22/Feb/19 ]

We can delete ServerTransactionsMetrics::getOldestActiveOpTime and related methods and data, I think, as part of this ticket.

Comment by Siyuan Zhou [ 21/Feb/19 ]

judah.schvimer and I discussed the priority of this ticket. If we don't start this work, we would have to do these four tickets to maintain the timestamps in-memory for larger transactions.

  • Maintain the oldest active transaction timestamp and the oldest required timestamp on primary.
  • Maintain the oldest active transaction timestamp and the oldest required timestamp on secondary
  • Reconstruct the in-memory data structures for active transactions on startup.
  • Reconstruct the in-memory data structures for active transactions on rollback (including rollback-via-refetch).

The first one might be simple, but updating the "first timestamp" on secondaries is more than just using a different timestamp, the work has to be done by SyncTail. Reconstructing in-memory data structures are also extra work beyond the Prepare project. This ticket also saves the write whenever the oldest active transaction timestamp is updated, which matters since we want to ensure the performance parity with 4.0 for unprepared transactions.

Thus, we want to prioritize this work and its dependency in storage interface, SERVER-39679. daniel.gottlieb, it'd be great if storage team can prioritize SERVER-39679. Alternative, we can work on SERVER-39679 with your guidance and code review.

I agree with tess.avitabile that we can let oplog to roll over when majority read concern is off in 4.2.0. If that turns out to be a problem, we could add the "traversing oplog" solution for it for later versions of 4.2.

Comment by Siyuan Zhou [ 21/Feb/19 ]

Larger Transaction will make the needed window of oplog larger than earlier versions. We could limit the size of transactions and write oplog entries of a transaction together to alleviate its impact.

Another solution: if all oplog entries of a transaction are written together by reserving OpTimes together when majority reads is off, we can jump to the stable timestamp in the oplog and traverse back if it's a part of transaction to find its start, which is the oldest required timestamp. From the perspective of oplog, there's at most one active transaction at any time. To solve the large oplog hole problem, we can limit transaction size.

daniel.gottlieb, a follow-up question: what would happen when "majority reads" is turned on in terms of maintaining the stable timestamp?

Comment by Tess Avitabile (Inactive) [ 20/Feb/19 ]

We do not allow prepare when majority reads are disabled.

If MongoDB has not historically kept sufficient oplog for rollback-via-refetch, then I do not believe we need to start doing so for the Larger Transactions project.

Comment by Daniel Gottlieb (Inactive) [ 20/Feb/19 ]

Do you know if we always kept history back to the majority commit point in earlier versions?

I don't believe we've ever done that. Historically, MongoDB would only keep some amount of oplog measured in bytes. Only recently did we start keeping oplog back to the "stable timestamp" for replication recovery.

Thinking pedantically about the problem though; if a "majority reads off" node is allowed to prepare and vote for transactions, I imagine there needs to be some guarantee about not throwing away the ledger of that transaction until its majority committed.

Comment by Tess Avitabile (Inactive) [ 20/Feb/19 ]

daniel.gottlieb, as I recall, this wasn't an explicit decision. Rather, since we don't use RTT when _keepDataHistory=false, we assumed we only need oplog for crash recovery. Do you know if we always kept history back to the majority commit point in earlier versions?

Comment by Daniel Gottlieb (Inactive) [ 20/Feb/19 ]

Re Siyuan's concern with majority reads off:

tess.avitabile I can say how the code in master behaves (I think), but I don't remember how much we explicitly decided on and how much I slipped in to avoid answering hard questions. If we did explicitly decide on this, this is your opportunity to opt for something more robust! I think the API for passing in the "maximum truncation timestamp" for a given fake "stable" timestamp was not expressive enough. Moving to a callback strategy removes that barrier.

I believe with majority reads off, we only pin enough oplog to do replication recovery after a restart. If rollback via refetch discovers it needs some oplog entry that's been lopped off, the mongod will simply need to resync. If we thought that was good enough a couple months ago, perhaps it's still good enough now?

Generated at Thu Feb 08 04:52:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.