[SERVER-47844] Update _setStableTimestampForStorage to set the stable timestamp without using the stable optime candidates set when EMRC=true Created: 29/Apr/20 Updated: 29/Oct/23 Resolved: 30/Jun/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.7.0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | William Schultz (Inactive) | Assignee: | William Schultz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2020-05-18, Repl 2020-06-01, Repl 2020-06-15, Repl 2020-06-29, Repl 2020-07-13 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 15 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Currently, we update the stable timestamp inside ReplicationCoordinatorImpl::_setStableTimestampForStorage by computing the stable optime from the set of stable optime candidates. We should remove the dependence on the stable optime candidates for setting the stable timestamp and updating the committed snapshot when enableMajorityReadConcern:true. We should be able to set the stable timestamp for storage directly as min(all_durable, lastCommittedOpTime). We will not remove any of the logic for computing and updating the stable optime candidates set as a part of this ticket. |
| Comments |
| Comment by Githook User [ 07/Oct/20 ] |
|
Author: {'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}Message: Includes the following partial backports:
(cherry picked from commit 1408e1b8a5392a9001ee598b5cec66afc4e1cf77)
(cherry picked from commit eee49c64cdeb8fa95704b9a316b779eb5eb9800c)
(cherry picked from commit 252251d38915b9e6722186b9742cc914a045d589) |
| Comment by Githook User [ 08/Sep/20 ] |
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: Includes the following partial backports:
(cherry picked from commit 1408e1b8a5392a9001ee598b5cec66afc4e1cf77)
(cherry picked from commit eee49c64cdeb8fa95704b9a316b779eb5eb9800c)
(cherry picked from commit 252251d38915b9e6722186b9742cc914a045d589) |
| Comment by Githook User [ 25/Aug/20 ] |
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: (cherry picked from commit 1408e1b8a5392a9001ee598b5cec66afc4e1cf77) |
| Comment by A. Jesse Jiryu Davis [ 20/Aug/20 ] |
|
I cherry-picked the first half of the changes for this ticket to 4.4, since it was a prerequisite for backporting |
| Comment by Githook User [ 20/Aug/20 ] |
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: (cherry picked from commit 1408e1b8a5392a9001ee598b5cec66afc4e1cf77) |
| Comment by William Schultz (Inactive) [ 09/Jul/20 ] |
|
After the changes from this ticket, when enableMajorityReadConcern:true, we no longer use the stable optime candidates list at all to compute the stable timestamp or the committed snapshot. Instead, the ReplicationCoordinatorImpl::_recalculateStableOpTime method computes the stable optime as the minimum of the no-overlap point and the lastCommittedOpTime. The no-overlap point is a timestamp that can be guaranteed to be "consistent" on both primaries and secondaries, and the concept already has a precedent in the storage layer. The definition of "consistent" here is a bit vague, but, generally, we need a timestamp T with a guarantee that no future transactions will commit at timestamps less than T. This is necessary because we need the stable timestamp and the committed snapshot to be safe for readers, since they are both timestamps used for reading data. We want those readers to see a view of data that is consistent with the oplog i.e. if they read at timestamp T the data returned should reflect all operations that have been applied in the oplog up to T. There are unsafe windows of timestamps on both primary and secondary where this property is violated. On primary, it occurs because we assign timestamps to concurrent transactions in an order that may be different from their commit order, leading to the creation of oplog "holes". On secondaries, the mechanisms of parallel batch application make it so that the data on disk during batch application may not reflect the oplog accurately, since the application of ops occurs in parallel and is non deterministic. So, the no-overlap timestamp provides a unified notion of what timestamps are safely "visible" for timestamp readers to look at. Beyond that timestamp, we may be in an undefined frontier, where reads may return incorrect or inconsistent data. The no-overlap point is thus computed as the minimum of the allDurable, which provides a safe timestamp on primaries, and the lastApplied, which provides a safe timestamp on secondaries, since it prevents us from ever looking beyond the end of the last complete secondary batch. In addition to the changes in how we compute the stable optime, we also added a few extra conditions around when we should avoid updating the stable timestamp, mainly related to initial sync. Previously, we would not add optimes to the stable optime candidates list during initial sync, which would prevent us from setting the stable timestamp or committed snapshot behind the oldest timestamp or the initialDataTimestamp. After removal of the candidates list, though, we need to have explicit checks that we don't try to set the stable timestamp behind the initialDataTimestamp after coming out of initial sync and that we don't set it during initial sync, since that would run the risk that we set it behind the oldest timestamp. Note that we are stuck using optimes for now when computing the stable optime, because the value is also used to set the _currentCommittedSnapshot, which is an optime. Eventually, though, it would be ideal to convert all of these values to timestamps only, since for storage engine reads (and local optime comparison), timestamps are sufficient. That will likely require a slightly larger refactor, though, since several other parts of the system are wired to use optimes e.g. the write concern notification logic and the configOpTime in sharding. I don't think these are fundamental impediments, but they probably require some more careful thought on how to re-organize things to care only about timestamps. |
| Comment by Githook User [ 30/Jun/20 ] |
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: |
| Comment by Githook User [ 30/Jun/20 ] |
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: |
| Comment by Judah Schvimer [ 11/Jun/20 ] |
I don't think this prevents setting the stable timestamp to an inconsistent timestamp immediately following leaving RECOVERING, after reaching minValid, but before minValid is majority committed. |
| Comment by William Schultz (Inactive) [ 09/Jun/20 ] |
|
Adding some notes on behavior of the stable timestamp in different states that I encountered while testing out some changes for this ticket. This discusses details of the enableMajorityReadConcern:true case. Initial Sync During initial sync, we take unstable checkpoints so the stable timestamp isn't really functionally important. Once we complete initial sync, we set our initialDataTimestamp to our lastApplied optime after finishing oplog application. Note that, throughout initial sync, we also continuously update our oldest_timestamp so we don't pin too much data in memory. Thus, we must impose certain restrictions on how we set the stable timestamp during and after leaving initial sync. In the current system, we are able to avoid setting our stable timestamp earlier than the initialDataTimestamp post initial sync by not adding optimes to the candidates list during initial sync. Without a stable candidates list, however, we still need to ensure that we don't set our stable timestamp behind the initialDataTimestamp. Similarly, we must also not set the stable timestamp behind the oldest_timestamp. These are invariants enforced at the storage layer. To prevent setting stableTimestamp < initialDataTimestamp after initial sync, we can explicitly disallow stable timestamp updates that are earlier than our current initialDataTimestamp. To prevent setting stableTimestamp < oldest_timestamp during initial sync, we can prevent stable timestamp updates entirely during initial sync i.e. STARTUP2 state. Note that we might be updating our lastCommittedOpTime during initial sync (since other nodes might be committing writes), so if we didn't explicitly disallow these updates, we might go on updating our stable timestamp during or after coming out of initial sync incorrectly. To summarize this case, these are the two important invariants we need to ensure are satisfied during initial sync and after leaving it:
Secondary One of the original motivations for the stable optime candidates list was to ensure that we didn't set the stable timestamp to a timestamp that falls in the middle of a secondary oplog application batch. We enforce this by only adding optimes at secondary batch boundaries to the candidate set. It is now safe to set the stable timestamp to a timestamp in the middle of a batch, but we still need to be careful to not set the stable timestamp in the middle of a batch while it is being applied. For example, consider a secondary that is applying a batch containing ops with times [1,2,3], and a node's lastCommittedOpTime has already advanced to time 3. Since we apply oplog entries in parallel, we might apply op at time 3, advance our stable timestamp to 3 (since it is <= our commit point), then try to write the oplog entry at time 2, which would violate the invariant that we do not commit a storage transaction at a timestamp behind the stable timestamp. To avoid this, we can constrain the stable timestamp to not surpass our current lastApplied optime, which will be set at the previous fully completed batch boundary. This prevents us from advancing the stable timestamp ahead of secondary batch writes that are at an earlier timestamp and have not yet completed. We can roughly summarize this case with one of the important invariants we need to uphold during secondary batch application:
Primary The behavior of the stable timestamp on primary is fairly straightforward i.e. in most cases we can just set it to the lastCommittedOpTime directly. In replica sets with > 1 voting nodes, the ops cannot be replicated before they are behind the allDurable timestamp, so the lastCommittedOpTime is always constrained to be behind allDurable. In single voting node replica sets, however, the lastCommittedOpTime will be updated to whatever our lastApplied is, so it may be ahead of the allDurable timestamp. We do not want to set our stable timestamp to an "inconsistent" timestamp > allDurable, however, so we need to constrain the stable timestamp to be no greater than allDurable on primaries i.e. the minimum of lastCommitted and allDurable. We can roughly summarize this case with these invariants (the second of which must hold true in any state):
Recovering When we have not yet reached minValid, we do not add optimes to the stable candidates list, because they may be at "inconsistent" points. We want to address this issue when we remove the candidates list. Avoiding stable optime updates during a non maintenance RECOVERING state may suffice to avoid setting the stable timestamp at an inconsistent point. However, it may be possible that we are in RECOVERING, have not reached minValid, and are also in maintenance mode, so we may need a way to detect explicitly if we have reached a consistent optime. Startup Recovery There was an invariant in place that dates back to 4.0, which verified that, if we have an appliedThrough value at startup, then it should be equal to the stable timestamp, since the appliedThrough is set to the last optime of a batch after it is applied, and we only set the stable timestamp on secondaries at batch boundaries. With removal of the stable optime candidates list, this will no longer be true, so it should be reasonable to remove this invariant. We can roughly summarize this case with the following invariant, which no longer must hold true:
|
| Comment by William Schultz (Inactive) [ 30/Apr/20 ] |
|
To make commits and reviews smaller, we can likely do this separately for EMRC=true and EMRC=false. Since we won't be removing the supporting logic for updating the stable optime candidates yet, it should be fine to temporarily have EMRC=true not use the stable optime candidates set while EMRC=false is still using the stable timestamp optime candidates set. |