[SERVER-47844] Update _setStableTimestampForStorage to set the stable timestamp without using the stable optime candidates set when EMRC=true Created: 29/Apr/20  Updated: 29/Oct/23  Resolved: 30/Jun/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Task Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: William Schultz (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-49006 Only advance oldest timestamp when se... Closed
is depended on by SERVER-48518 Rollback via refetch (EMRC = false) c... Closed
is depended on by SERVER-47845 Remove obsolete code related to stori... Closed
Problem/Incident
causes SERVER-49472 Add back holdStableTimestampAtSpecifi... Closed
Related
related to SERVER-55305 Retryable write may execute more than... Closed
related to SERVER-49221 Complete TODO listed in SERVER-47844 Closed
related to SERVER-49355 Complete TODO listed in SERVER-47844 Closed
related to SERVER-49406 Complete TODO listed in SERVER-47844 Closed
is related to SERVER-33806 Oldest timestamp can move ahead of th... Closed
is related to SERVER-30577 Clear list of stable timestamp candid... Closed
is related to SERVER-33292 Have storage dictate where replicatio... Closed
is related to SERVER-49167 Update _setStableTimestampForStorage ... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2020-05-18, Repl 2020-06-01, Repl 2020-06-15, Repl 2020-06-29, Repl 2020-07-13
Participants:
Linked BF Score: 15

 Description   

Currently, we update the stable timestamp inside ReplicationCoordinatorImpl::_setStableTimestampForStorage by computing the stable optime from the set of stable optime candidates. We should remove the dependence on the stable optime candidates for setting the stable timestamp and updating the committed snapshot when enableMajorityReadConcern:true. We should be able to set the stable timestamp for storage directly as min(all_durable, lastCommittedOpTime). We will not remove any of the logic for computing and updating the stable optime candidates set as a part of this ticket.



 Comments   
Comment by Githook User [ 07/Oct/20 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-48518 Fix rollback via refetch anomaly, try 2

Includes the following partial backports:

SERVER-47844 Add the ability to get the initialDataTimestamp from the storage engine interface

(cherry picked from commit 1408e1b8a5392a9001ee598b5cec66afc4e1cf77)
(cherry picked from commit 329d8c517d8b3c3fb4bcb63eecf6064ac9a007cf)

SERVER-48518 Fix rollback via refetch anomaly

(cherry picked from commit eee49c64cdeb8fa95704b9a316b779eb5eb9800c)
(cherry picked from commit 88c0265e057f0e5581306f294d1ca2bda19760e4)

SERVER-50183 Copy _awaitPrimaryAppliedSurpassesRollbackApplied function from RollbackTest to RollbackTestDeluxe

(cherry picked from commit 252251d38915b9e6722186b9742cc914a045d589)
(cherry picked from commit d4b960b5f3f4a7a2b18b48d7fb14251704a8bda8)
Branch: v4.0
https://github.com/mongodb/mongo/commit/ece51101e58dfaf7e455c8c96df6ade42b99515c

Comment by Githook User [ 08/Sep/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-48518 Fix rollback via refetch anomaly

Includes the following partial backports:

SERVER-47844 Add the ability to get the initialDataTimestamp from the storage engine interface

(cherry picked from commit 1408e1b8a5392a9001ee598b5cec66afc4e1cf77)
(cherry picked from commit 329d8c517d8b3c3fb4bcb63eecf6064ac9a007cf)

SERVER-48518 Fix rollback via refetch anomaly

(cherry picked from commit eee49c64cdeb8fa95704b9a316b779eb5eb9800c)
(cherry picked from commit 88c0265e057f0e5581306f294d1ca2bda19760e4)

SERVER-50183 Copy _awaitPrimaryAppliedSurpassesRollbackApplied function from RollbackTest to RollbackTestDeluxe

(cherry picked from commit 252251d38915b9e6722186b9742cc914a045d589)
(cherry picked from commit d4b960b5f3f4a7a2b18b48d7fb14251704a8bda8)
Branch: v4.0
https://github.com/mongodb/mongo/commit/b07f80de5850c665e75dc259def6b8999d1077dd

Comment by Githook User [ 25/Aug/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-47844 Add the ability to get the initialDataTimestamp from the storage engine interface

(cherry picked from commit 1408e1b8a5392a9001ee598b5cec66afc4e1cf77)
(cherry picked from commit 329d8c517d8b3c3fb4bcb63eecf6064ac9a007cf)
Branch: v4.2
https://github.com/mongodb/mongo/commit/b4d587314e3f8bc9cfb800a9ede40349756dfd86

Comment by A. Jesse Jiryu Davis [ 20/Aug/20 ]

I cherry-picked the first half of the changes for this ticket to 4.4, since it was a prerequisite for backporting SERVER-48518 to 4.4. We're not going to backport the rest of the changes for this ticket.

Comment by Githook User [ 20/Aug/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-47844 Add the ability to get the initialDataTimestamp from the storage engine interface

(cherry picked from commit 1408e1b8a5392a9001ee598b5cec66afc4e1cf77)
Branch: v4.4
https://github.com/mongodb/mongo/commit/329d8c517d8b3c3fb4bcb63eecf6064ac9a007cf

Comment by William Schultz (Inactive) [ 09/Jul/20 ]

After the changes from this ticket, when enableMajorityReadConcern:true, we no longer use the stable optime candidates list at all to compute the stable timestamp or the committed snapshot. Instead, the ReplicationCoordinatorImpl::_recalculateStableOpTime method computes the stable optime as the minimum of the no-overlap point and the lastCommittedOpTime. The no-overlap point is a timestamp that can be guaranteed to be "consistent" on both primaries and secondaries, and the concept already has a precedent in the storage layer. The definition of "consistent" here is a bit vague, but, generally, we need a timestamp T with a guarantee that no future transactions will commit at timestamps less than T.

This is necessary because we need the stable timestamp and the committed snapshot to be safe for readers, since they are both timestamps used for reading data. We want those readers to see a view of data that is consistent with the oplog i.e. if they read at timestamp T the data returned should reflect all operations that have been applied in the oplog up to T. There are unsafe windows of timestamps on both primary and secondary where this property is violated. On primary, it occurs because we assign timestamps to concurrent transactions in an order that may be different from their commit order, leading to the creation of oplog "holes". On secondaries, the mechanisms of parallel batch application make it so that the data on disk during batch application may not reflect the oplog accurately, since the application of ops occurs in parallel and is non deterministic. So, the no-overlap timestamp provides a unified notion of what timestamps are safely "visible" for timestamp readers to look at. Beyond that timestamp, we may be in an undefined frontier, where reads may return incorrect or inconsistent data. The no-overlap point is thus computed as the minimum of the allDurable, which provides a safe timestamp on primaries, and the lastApplied, which provides a safe timestamp on secondaries, since it prevents us from ever looking beyond the end of the last complete secondary batch.

In addition to the changes in how we compute the stable optime, we also added a few extra conditions around when we should avoid updating the stable timestamp, mainly related to initial sync. Previously, we would not add optimes to the stable optime candidates list during initial sync, which would prevent us from setting the stable timestamp or committed snapshot behind the oldest timestamp or the initialDataTimestamp. After removal of the candidates list, though, we need to have explicit checks that we don't try to set the stable timestamp behind the initialDataTimestamp after coming out of initial sync and that we don't set it during initial sync, since that would run the risk that we set it behind the oldest timestamp.

Note that we are stuck using optimes for now when computing the stable optime, because the value is also used to set the _currentCommittedSnapshot, which is an optime. Eventually, though, it would be ideal to convert all of these values to timestamps only, since for storage engine reads (and local optime comparison), timestamps are sufficient. That will likely require a slightly larger refactor, though, since several other parts of the system are wired to use optimes e.g. the write concern notification logic and the configOpTime in sharding. I don't think these are fundamental impediments, but they probably require some more careful thought on how to re-organize things to care only about timestamps.

Comment by Githook User [ 30/Jun/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-47844 Set the stable timestamp without using the stable optime candidates when enableMajorityReadConcern:true
Branch: master
https://github.com/mongodb/mongo/commit/02020fa91c62562cb08f30c8130baf0791cc0a67

Comment by Githook User [ 30/Jun/20 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-47844 Add the ability to get the initialDataTimestamp from the storage engine interface
Branch: master
https://github.com/mongodb/mongo/commit/1408e1b8a5392a9001ee598b5cec66afc4e1cf77

Comment by Judah Schvimer [ 11/Jun/20 ]

Avoiding stable optime updates during a non maintenance RECOVERING state may suffice to avoid setting the stable timestamp at an inconsistent point.

I don't think this prevents setting the stable timestamp to an inconsistent timestamp immediately following leaving RECOVERING, after reaching minValid, but before minValid is majority committed.

Comment by William Schultz (Inactive) [ 09/Jun/20 ]

Adding some notes on behavior of the stable timestamp in different states that I encountered while testing out some changes for this ticket. This discusses details of the enableMajorityReadConcern:true case.

Initial Sync

During initial sync, we take unstable checkpoints so the stable timestamp isn't really functionally important. Once we complete initial sync, we set our initialDataTimestamp to our lastApplied optime after finishing oplog application. Note that, throughout initial sync, we also continuously update our oldest_timestamp so we don't pin too much data in memory. Thus, we must impose certain restrictions on how we set the stable timestamp during and after leaving initial sync. In the current system, we are able to avoid setting our stable timestamp earlier than the initialDataTimestamp post initial sync by not adding optimes to the candidates list during initial sync. Without a stable candidates list, however, we still need to ensure that we don't set our stable timestamp behind the initialDataTimestamp. Similarly, we must also not set the stable timestamp behind the oldest_timestamp. These are invariants enforced at the storage layer.

To prevent setting stableTimestamp < initialDataTimestamp after initial sync, we can explicitly disallow stable timestamp updates that are earlier than our current initialDataTimestamp. To prevent setting stableTimestamp < oldest_timestamp during initial sync, we can prevent stable timestamp updates entirely during initial sync i.e. STARTUP2 state. Note that we might be updating our lastCommittedOpTime during initial sync (since other nodes might be committing writes), so if we didn't explicitly disallow these updates, we might go on updating our stable timestamp during or after coming out of initial sync incorrectly. To summarize this case, these are the two important invariants we need to ensure are satisfied during initial sync and after leaving it:

  • stable_timestamp >= oldest_timestamp
  • stable_timestamp >= initialDataTimestamp

Secondary

One of the original motivations for the stable optime candidates list was to ensure that we didn't set the stable timestamp to a timestamp that falls in the middle of a secondary oplog application batch. We enforce this by only adding optimes at secondary batch boundaries to the candidate set. It is now safe to set the stable timestamp to a timestamp in the middle of a batch, but we still need to be careful to not set the stable timestamp in the middle of a batch while it is being applied. For example, consider a secondary that is applying a batch containing ops with times [1,2,3], and a node's lastCommittedOpTime has already advanced to time 3. Since we apply oplog entries in parallel, we might apply op at time 3, advance our stable timestamp to 3 (since it is <= our commit point), then try to write the oplog entry at time 2, which would violate the invariant that we do not commit a storage transaction at a timestamp behind the stable timestamp. To avoid this, we can constrain the stable timestamp to not surpass our current lastApplied optime, which will be set at the previous fully completed batch boundary. This prevents us from advancing the stable timestamp ahead of secondary batch writes that are at an earlier timestamp and have not yet completed. We can roughly summarize this case with one of the important invariants we need to uphold during secondary batch application:

  • commit_timestamp >= stable_timestamp

Primary

The behavior of the stable timestamp on primary is fairly straightforward i.e. in most cases we can just set it to the lastCommittedOpTime directly. In replica sets with > 1 voting nodes, the ops cannot be replicated before they are behind the allDurable timestamp, so the lastCommittedOpTime is always constrained to be behind allDurable. In single voting node replica sets, however, the lastCommittedOpTime will be updated to whatever our lastApplied is, so it may be ahead of the allDurable timestamp. We do not want to set our stable timestamp to an "inconsistent" timestamp > allDurable, however, so we need to constrain the stable timestamp to be no greater than allDurable on primaries i.e. the minimum of lastCommitted and allDurable. We can roughly summarize this case with these invariants (the second of which must hold true in any state):

  • stable_timestamp <= all_durable
  • stable_timestamp <= lastCommitted

Recovering

When we have not yet reached minValid, we do not add optimes to the stable candidates list, because they may be at "inconsistent" points. We want to address this issue when we remove the candidates list. Avoiding stable optime updates during a non maintenance RECOVERING state may suffice to avoid setting the stable timestamp at an inconsistent point. However, it may be possible that we are in RECOVERING, have not reached minValid, and are also in maintenance mode, so we may need a way to detect explicitly if we have reached a consistent optime.

Startup Recovery

There was an invariant in place that dates back to 4.0, which verified that, if we have an appliedThrough value at startup, then it should be equal to the stable timestamp, since the appliedThrough is set to the last optime of a batch after it is applied, and we only set the stable timestamp on secondaries at batch boundaries. With removal of the stable optime candidates list, this will no longer be true, so it should be reasonable to remove this invariant. We can roughly summarize this case with the following invariant, which no longer must hold true:

  • stable_timestamp ∈ batch_boundaries (no longer true)
Comment by William Schultz (Inactive) [ 30/Apr/20 ]

To make commits and reviews smaller, we can likely do this separately for EMRC=true and EMRC=false. Since we won't be removing the supporting logic for updating the stable optime candidates yet, it should be fine to temporarily have EMRC=true not use the stable optime candidates set while EMRC=false is still using the stable timestamp optime candidates set.

Generated at Thu Feb 08 05:15:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.