[SERVER-41861] stableTimestamp calculation makes incorrect assumptions about all_committed Created: 21/Jun/19  Updated: 29/Oct/23  Resolved: 26/Jul/19

Status: Closed
Project: Core Server
Component/s: Replication, Storage, WiredTiger
Affects Version/s: None
Fix Version/s: 4.2.0-rc5, 4.3.1

Type: Bug Priority: Critical - P2
Reporter: Judah Schvimer Assignee: Gregory Wlodarek
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to WT-4900 Implement all_durable timestamp Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2
Sprint: Execution Team 2019-07-15, Storage Engines 2019-07-01, Execution Team 2019-07-29
Participants:
Linked BF Score: 12

 Description   

This explanation is incorrect when prepared transactions are getting committed.

The all_committed is the (timestamp of the earliest uncommitted transaction that has a commit timestamp) - 1. For prepared transactions, until commit time the transaction isn't included in the all_committed because it is not timestamped. At commit time, the all_committed can briefly jump back to the commitTimestamp-1 between when we set the commitTimestamp on the transaction and when we actually commit the transaction.

This invalidates the assumption that the all_committed is always "in the same term" as the commitPoint on a primary.

This also invalidates any assumptions we've made about the all_committed always moving forward.

There are 3 options I can think of:

  1. Change the semantic meaning of all_committed to be all_durable and use the durable timestamp rather than the commit timestamp to calculate it. This is in line with the idea of all_committed really being used to determine when oplog holes are open. michael.cahill thinks this isn't too hard and is reasonable if needed, though it does require more thought since it's a significant API change.
  2. Add a mechanism for committing a transaction with a commitTimestamp such that it is never counted in calculating all_committed and use it for any storage-transactions (including prepared mongodb transactions) that timestamp their transactions only right before commit time.
  3. Try to work around the current all_committed behavior in stableTimestamp calculation. This doesn't fix the problem of all_committed moving backwards, if in fact that's a problem in other places where we just haven't seen it.


 Comments   
Comment by Githook User [ 26/Jul/19 ]

Author:

{'name': 'Gregory Wlodarek', 'username': 'GWlodarek', 'email': 'gregory.wlodarek@mongodb.com'}

Message: SERVER-41861 Simplify the concurrency between timestamp_transaction and commit_transaction in WiredTigerRecoveryUnit::_txnClose()

(cherry picked from commit 65f608a4b17440d75ece209e209401e1d74ad638)
Branch: v4.2
https://github.com/mongodb/mongo/commit/713831d52eff7169d58ae3bf1b0fff735fdae305

Comment by Githook User [ 26/Jul/19 ]

Author:

{'name': 'Gregory Wlodarek', 'username': 'GWlodarek', 'email': 'gregory.wlodarek@mongodb.com'}

Message: SERVER-41861 Change existing jstests to mention the new 'all_durable' timestamp over the deprecated 'all_committed' timestamp

(cherry picked from commit e6b6a2231ae7f05c3c0f6fc1a0ce111792436e58)
Branch: v4.2
https://github.com/mongodb/mongo/commit/fc5b2b8fef78ab18d9560bce3015802e54dfb248

Comment by Githook User [ 26/Jul/19 ]

Author:

{'name': 'Gregory Wlodarek', 'username': 'GWlodarek', 'email': 'gregory.wlodarek@mongodb.com'}

Message: SERVER-41861 Replace 'all_committed' with 'all_durable'

(cherry picked from commit 25d5f6a0b01f261e633587013e4ab8116ea2930a)
Branch: v4.2
https://github.com/mongodb/mongo/commit/daf8b271fb960d65a579bb3e10cc77ca5b16d4a7

Comment by Githook User [ 26/Jul/19 ]

Author:

{'name': 'Gregory Wlodarek', 'username': 'GWlodarek', 'email': 'gregory.wlodarek@mongodb.com'}

Message: SERVER-41861 Simplify the concurrency between timestamp_transaction and commit_transaction in WiredTigerRecoveryUnit::_txnClose()
Branch: master
https://github.com/mongodb/mongo/commit/65f608a4b17440d75ece209e209401e1d74ad638

Comment by Githook User [ 26/Jul/19 ]

Author:

{'name': 'Gregory Wlodarek', 'username': 'GWlodarek', 'email': 'gregory.wlodarek@mongodb.com'}

Message: SERVER-41861 Change existing jstests to mention the new 'all_durable' timestamp over the deprecated 'all_committed' timestamp
Branch: master
https://github.com/mongodb/mongo/commit/e6b6a2231ae7f05c3c0f6fc1a0ce111792436e58

Comment by Githook User [ 26/Jul/19 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-41861 Replace 'all_committed' with 'all_durable'
Branch: master
https://github.com/mongodb/mongo/commit/25d5f6a0b01f261e633587013e4ab8116ea2930a

Comment by Jocelyn del Prado [ 19/Jul/19 ]

milkie, the storage engines work here is done (see WT-4900), pending a drop into master. Can you please make sure this gets attention as soon as possible?

Comment by Michael Cahill (Inactive) [ 18/Jul/19 ]

Repeating an offline conversation: WiredTigerRecordStore::oplogDiskLocRegister needs to be changed to set the durable timestamp for prepared transactions (as opposed to the commit timestamp it is currently setting) for this functionality requested by the Replication team to be fully integrated.

Comment by Alex Cameron (Inactive) [ 18/Jul/19 ]

kelsey.schubert milkie
WT-4900 added an all_durable timestamp and a connection level durable_timestamp. As it stands, the deprecated all_committed and connection level commit_timestamp are just symlinks to the aforementioned new timestamps so there shouldn't be any functional change happening in that MongoDB work.

There will be a WT drop happening shortly. Provided that there's no fallout from that, I'll assign this ticket back to Replication Backlog.

Comment by Eric Milkie [ 17/Jul/19 ]

There is a bit of work in the MongoDB code to start consuming the all_durable value out of WT.

Comment by Kelsey Schubert [ 17/Jul/19 ]

Is there work that needs to be done under this ticket or can it be closed as a dup of WT-4900?

Comment by Alexander Gorrod [ 27/Jun/19 ]

If we make a WiredTiger change to address this, it's possible that we'll need to stage delivery of it, i.e: add something new while retaining the old behavior, then removing the old behavior. Otherwise we'll need to carefully stage delivery with MongoDB changes.

Comment by Judah Schvimer [ 21/Jun/19 ]

I'll assign this to the storage engines team for investigation.

Comment by Eric Milkie [ 21/Jun/19 ]

I'd like to further explore option number 1, because it's the most elegant solution, as long as there aren't other issues with it that we haven't thought of yet.

Comment by Judah Schvimer [ 21/Jun/19 ]

I don't think it would be possible to work around this just in replication. The stableTimestamp needs to be behind the all_committed so any contract where the all_committed can move backwards would make that impossible to guarantee.

What do people think of the two storage solutions (1 and 2 above)?

CC milkie geert.bosch michael.cahill agorrod suganthi.mani

Generated at Thu Feb 08 04:58:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.