[SERVER-39680] Maintain the oldest active transaction timestamp only with the transaction table Created: 19/Feb/19 Updated: 29/Oct/23 Resolved: 07/Mar/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.9 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Siyuan Zhou | Assignee: | A. Jesse Jiryu Davis |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2019-03-11 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
The current plan for Prepare Support for Transactions uses an in-memory data structure to maintain the oldest active transaction timestamp. Instead, we can store the first oplog entry's timestamp of a transaction in the transaction table. Thus the transaction table has all necessary information to calculate the oldest active transaction timestamp. With If table scan's performance isn't good enough, we can add an index on the "firstTimestamp" field. The index can be a partial index on "firstTimestamp" when a new "active: true" field exists, so that retryable writes and finished transactions don't affect the performance. |
| Comments |
| Comment by Siyuan Zhou [ 19/Apr/19 ] |
|
david.golden, |
| Comment by David Golden [ 19/Apr/19 ] |
|
For mongomirror and mongodump, could someone please summarize how these tools should determine the oldest active transaction timestamp? The previous design discussions all assumed this would be available from serverStatus. |
| Comment by Judah Schvimer [ 07/Mar/19 ] |
|
I filed |
| Comment by Judah Schvimer [ 06/Mar/19 ] |
|
The upgrade-downgrade logic is also still todo, but can be done in a follow-up ticket or later on. |
| Comment by Githook User [ 06/Mar/19 ] |
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: |
| Comment by Githook User [ 06/Mar/19 ] |
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: |
| Comment by Judah Schvimer [ 25/Feb/19 ] |
|
Thank you for the clarification. I may not have put that in the right implementation order, but I'll defer to you if you think this makes sense to do before finishing the parts required for "prepare". I don't expect the upgrade-downgrade logic here to be very difficult since |
| Comment by A. Jesse Jiryu Davis [ 25/Feb/19 ] |
|
I think that doing this ticket is part of the plan at the bottom of > finding the OATT by doing a timestamped read on config.transactions will be significantly easier for “Larger Transactions”, and sidestep the open questions of the original OATT design in Appendix II of the Prepare Design. As a result, we are going to pivot to that solution to save net work. I think this ticket has two parts: first, add and use a new field in the transactions table for tracking each transaction's oldest timestamp. Second, delete the in-memory data structure in ServerTransactionMetrics and use the transactions table exclusively. We have some freedom about the order in which we do the parts of this ticket and the parts of |
| Comment by Judah Schvimer [ 25/Feb/19 ] |
|
jesse, is this required for |
| Comment by Judah Schvimer [ 22/Feb/19 ] |
|
This will need upgrade-downgrade logic similar to |
| Comment by A. Jesse Jiryu Davis [ 22/Feb/19 ] |
|
We can delete ServerTransactionsMetrics::getOldestActiveOpTime and related methods and data, I think, as part of this ticket. |
| Comment by Siyuan Zhou [ 21/Feb/19 ] |
|
judah.schvimer and I discussed the priority of this ticket. If we don't start this work, we would have to do these four tickets to maintain the timestamps in-memory for larger transactions.
The first one might be simple, but updating the "first timestamp" on secondaries is more than just using a different timestamp, the work has to be done by SyncTail. Reconstructing in-memory data structures are also extra work beyond the Prepare project. This ticket also saves the write whenever the oldest active transaction timestamp is updated, which matters since we want to ensure the performance parity with 4.0 for unprepared transactions. Thus, we want to prioritize this work and its dependency in storage interface, I agree with tess.avitabile that we can let oplog to roll over when majority read concern is off in 4.2.0. If that turns out to be a problem, we could add the "traversing oplog" solution for it for later versions of 4.2. |
| Comment by Siyuan Zhou [ 21/Feb/19 ] |
|
Larger Transaction will make the needed window of oplog larger than earlier versions. We could limit the size of transactions and write oplog entries of a transaction together to alleviate its impact. Another solution: if all oplog entries of a transaction are written together by reserving OpTimes together when majority reads is off, we can jump to the stable timestamp in the oplog and traverse back if it's a part of transaction to find its start, which is the oldest required timestamp. From the perspective of oplog, there's at most one active transaction at any time. To solve the large oplog hole problem, we can limit transaction size. daniel.gottlieb, a follow-up question: what would happen when "majority reads" is turned on in terms of maintaining the stable timestamp? |
| Comment by Tess Avitabile (Inactive) [ 20/Feb/19 ] |
|
We do not allow prepare when majority reads are disabled. If MongoDB has not historically kept sufficient oplog for rollback-via-refetch, then I do not believe we need to start doing so for the Larger Transactions project. |
| Comment by Daniel Gottlieb (Inactive) [ 20/Feb/19 ] |
I don't believe we've ever done that. Historically, MongoDB would only keep some amount of oplog measured in bytes. Only recently did we start keeping oplog back to the "stable timestamp" for replication recovery. Thinking pedantically about the problem though; if a "majority reads off" node is allowed to prepare and vote for transactions, I imagine there needs to be some guarantee about not throwing away the ledger of that transaction until its majority committed. |
| Comment by Tess Avitabile (Inactive) [ 20/Feb/19 ] |
|
daniel.gottlieb, as I recall, this wasn't an explicit decision. Rather, since we don't use RTT when _keepDataHistory=false, we assumed we only need oplog for crash recovery. Do you know if we always kept history back to the majority commit point in earlier versions? |
| Comment by Daniel Gottlieb (Inactive) [ 20/Feb/19 ] |
|
Re Siyuan's concern with majority reads off: tess.avitabile I can say how the code in master behaves (I think), but I don't remember how much we explicitly decided on and how much I slipped in to avoid answering hard questions. If we did explicitly decide on this, this is your opportunity to opt for something more robust! I think the API for passing in the "maximum truncation timestamp" for a given fake "stable" timestamp was not expressive enough. Moving to a callback strategy removes that barrier. I believe with majority reads off, we only pin enough oplog to do replication recovery after a restart. If rollback via refetch discovers it needs some oplog entry that's been lopped off, the mongod will simply need to resync. If we thought that was good enough a couple months ago, perhaps it's still good enough now? |