[SERVER-39356] Change stream updateLookup queries with speculative majority may return uncommitted data Created: 01/Feb/19  Updated: 29/Oct/23  Resolved: 15/Mar/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.1.7
Fix Version/s: 4.1.10

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: William Schultz (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-39753 getMore commands on aggregate cursors... Closed
Related
related to SERVER-39383 Speculative majority change stream up... Closed
related to SERVER-57197 [ephemeralForTest] improve error mess... Closed
related to SERVER-39364 Audit uses of setLastOpToSystemLastOp... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2019-02-25, Repl 2019-03-11, Repl 2019-03-25
Participants:
Linked BF Score: 57

 Description   

Speculative majority change streams that do an update lookup query will wait for the most recent lastApplied optime of a replica set node to majority commit before returning results to the client. This is intended to provide a guarantee to the client that the data it received is majority committed. This contract may be violated, however, in the case where a node's lastApplied optime lags behind the optime of the newest "storage committed" oplog entry. That is, there may be an oplog entry (and corresponding data operation) written to storage that is visible to readers, but the lastApplied optime of the node does not yet reflect it. This is possible because a primary node advances its lastApplied optime inside the onCommit handler of an operation's transaction. There is a nonzero length of time between the commit of the WriteUnitOfWork at the storage layer and when the optime is advanced for that operation. If a concurrent reader reads the effects of such a transaction and reads lastApplied before the onCommit handler has fired, it may wait for the incorrect optime to commit and return data that is not, in fact, majority committed. This is an issue for primaries. On secondaries lastApplied is only updated at the end of batch application, so the same problem does not manifest.



 Comments   
Comment by Githook User [ 18/Mar/19 ]

Author:

{'email': 'william.schultz@mongodb.com', 'name': 'William Schultz', 'username': 'will62794'}

Message: SERVER-39383 Add a test for speculative majority change stream secondary reads during batch application

This commit adds an integration test to verify that speculative majority change stream reads do not return incorrect results when reading concurrently with secondary batch application. The goal is to ensure that, due to the changes from SERVER-39356, these reads will read from the most recent lastApplied timestamp on secondaries.
Branch: master
https://github.com/mongodb/mongo/commit/c8120ddaf8a8bd9da9c8095165a4df485d5a58c9

Comment by Githook User [ 15/Mar/19 ]

Author:

{'email': 'william.schultz@mongodb.com', 'name': 'William Schultz', 'username': 'will62794'}

Message: SERVER-39356 Make speculative majority change stream reads utilize the `kNoOverlap` timestamp read source

Speculative majority change streams provide "majority" read guarantees by reading from a local snapshot of data and then waiting for that data to become majority committed, instead of reading directly from a majority committed snapshot. In order to satisfy this guarantee a speculative majority read must wait for the proper timestamp to become majority committed after reading data. If the newest data it read reflects a timestamp T, then it must wait for a timestamp >= T to become majority committed. In general, waiting on replication's lastApplied timestamp is not safe, since it is possible for writes to be visible to readers even if those writes have not yet advanced the in-memory value of lastApplied. To work around this issue for speculative majority reads, we instead choose to read from an explicitly chosen timestamp in the storage engine, and then wait on that timestamp to majority commit. This gives us a more direct way to know what timestamp the data we read reflects. We utilize the `kNoOverlap` read source, which allows us to read from the min(lastApplied, all_committed), which is a convenient way to make these reads work correctly on both primaries and secondaries.
Branch: master
https://github.com/mongodb/mongo/commit/c83b8d8aab53f7545851a76425b2f2cd7c598cbd

Comment by Githook User [ 06/Mar/19 ]

Author:

{'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}

Message: SERVER-39356 Refactor speculative majority read data structures and methods to use timestamps instead of optimes

This patch refactors the SpeculativeMajorityReadInfo class and the awaitOpTimeCommitted method to accept timestamps as input instead of optimes. When waiting for an operation to majority commit, term information, which is included in optimes, isn't necessary, since timestamps are totally ordered within a local oplog, and so are safely comparable. It is, for example, safe to determine if a local oplog entry is majority committed by checking if its timestamp is less than that node's local view of the majority commit point. This patch should not introduce any observable functional changes.
Branch: master
https://github.com/mongodb/mongo/commit/fcbc0c9c936c83612545ee5873b649854e4b5e57

Comment by William Schultz (Inactive) [ 05/Feb/19 ]

This problem and intended fix should be specific to primaries. The proposed solution is to have update lookup queries read from all_committed on primary and wait on that timestamp to commit so we can be guaranteed the data read became committed. On secondaries, the issue is not quite the same, since lastApplied is only updated at the end of each batch. On secondaries, update lookup queries can read during the middle of batch application, which causes a different problem, referenced in SERVER-39383.

Generated at Thu Feb 08 04:51:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.