[SERVER-39383] Speculative majority change stream update lookup on secondary may return pre-image Created: 05/Feb/19  Updated: 29/Oct/23  Resolved: 18/Mar/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.10

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: William Schultz (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-33846 Alternative for setting oplog read ti... Closed
is related to SERVER-39356 Change stream updateLookup queries wi... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2019-02-25, Repl 2019-03-11, Repl 2019-03-25
Participants:

 Description   

Currently speculative majority change streams on secondary read from the local snapshot and then, after executing, wait for the most recent lastApplied optime to majority commit. For update lookup queries, reading at a local snapshot means that nothing prevents them from reading in the middle of secondary batch application, since majority read concern levels by default do not read from the current lastApplied timestamp. Since we write out oplog entries before we apply them on secondaries, this means that an update lookup query may return a document version earlier than its associated change event. To fix this, we should instead have speculative change stream oplog queries and update lookup queries both read from the lastApplied timestamp. After executing, they can wait for this read timestamp to majority commit. This will ensure that update lookup queries always read data at least as new as their associated change event.



 Comments   
Comment by Githook User [ 18/Mar/19 ]

Author:

{'email': 'william.schultz@mongodb.com', 'name': 'William Schultz', 'username': 'will62794'}

Message: SERVER-39383 Add a test for speculative majority change stream secondary reads during batch application

This commit adds an integration test to verify that speculative majority change stream reads do not return incorrect results when reading concurrently with secondary batch application. The goal is to ensure that, due to the changes from SERVER-39356, these reads will read from the most recent lastApplied timestamp on secondaries.
Branch: master
https://github.com/mongodb/mongo/commit/c8120ddaf8a8bd9da9c8095165a4df485d5a58c9

Comment by William Schultz (Inactive) [ 13/Mar/19 ]

Code review URL:

https://mongodbcr.appspot.com/453960002/

Comment by Tess Avitabile (Inactive) [ 11/Mar/19 ]

Based on the above, I would agree that this bug does not exist.

Comment by William Schultz (Inactive) [ 11/Mar/19 ]

Ok, so I have one additional finding which is relevant. When we are not doing document post image lookups, change streams will only read the oplog. When doing update lookups, though, we will call into MongoInterfaceStandalone to find a document's post image. The MongoInterfaceStandalone::lookupSingleDocument method makes a call into MongoInterfaceStandalone::_getCollectionDefaultCollator, which, as it turns out, takes a MODE_IS collection lock using the AutoGetCollection interface. AutoGetCollection doesn't include any special logic to have operations avoid conflicting with secondary batch application i.e. they will block against the PBWM. I think this should imply that update lookup queries done locally will never be able to run concurrently with batch application i.e. they should only be able to read strictly before or after application of a batch. The change stream oplog read should be able to read while the batch is being applied, but once it gets to the update lookup query, it will block until the batch is complete and the PBWM is released. This should mean that, at least with the current implementation of update lookup queries, it shouldn't be possible for them to return a pre-image of their associated change event.

Comment by William Schultz (Inactive) [ 08/Mar/19 ]

Your point is valid. I think I need to alter my view a bit. Above I said that the visibility point on secondaries could never be advanced in the middle of batch application, but I think that is not accurate, as you said. This condition variable in the oplog journal thread loop is only signalled by WiredTigerOplogManager::triggerJournalFlush, which is only called in two places as referenced above. It should certainly be possible, though, for that condition variable to be woken up at any time, or to be signalled from inside WiredTigerRecoveryUnit::_txnClose here. I had trouble reproing this case perhaps because there was nothing to call triggerJournalFlush() from within _txnClose. It should be possible for that to happen, though (or for the CV to be spuriously woken up). I will need to work more to come up with a clean repro of this.

Comment by Tess Avitabile (Inactive) [ 08/Mar/19 ]

So, this means that the oplog visibility point on secondaries should never advance in the middle of batch application.

I'm confused, since this was not my understanding of the system. The fact that oplog readers can read a batch that has not yet been applied was what caused the invariant failure on secondaries in SERVER-38499. We considered fixing SERVER-38499 by not updating the all-committed until the end of batch application but decided not to do so, since it would slow majority writes.

Comment by William Schultz (Inactive) [ 07/Mar/19 ]

After investigating the code to see if this bug is possible, here are some of my findings. My goal was to exercise the case where a change stream read executes in the middle of secondary batch application, after oplog entries have been written but before they have been applied. It was theorized that this would cause the query to potentially return a batch of oplog entries that haven't been applied yet, and so an update lookup query would read data earlier than the change events returned. It turns out that the oplog visibility point on secondaries is updated synchronously, in accordance with the completion of application of a batch. That is, when the oplog entries have been both written and applied. This is done inside the replicationBatchIsComplete() method, which will make a call to WiredTigerOplogManager::triggerJournalFlush(), which signals the oplog visibility thread to wake up and advance the oplog read timestamp. The other place we trigger journal flushes like this is inside WiredTigerRecoveryUnit::_txnClose(), but we have special rules for when to actually trigger a journal flush there. The comment there indicates this:

// We only need to update oplog visibility where commits can be out-of-order with
// respect to their assigned optime and such commits might otherwise be visible.
// This should happen only on primary nodes.
_oplogManager->triggerJournalFlush();

So, this means that the oplog visibility point on secondaries should never advance in the middle of batch application. This implies that we will never read oplog entries past the newest timestamp of the most recent successful oplog batch, since change stream oplog scans are restricted from reading past the visibility timestamp. This should prevent the case of update lookup queries reading the pre-image of a returned change event.

Since speculative majority change stream queries do not read at the lastApplied timestamp, though, update lookup queries may still be at risk of reading concurrently with secondary batch application. I may still need to investigate that issue. My understanding was that one of the reasons we do not allow reads during secondary batch application is because we may temporarily violate unique index constraints. I wonder if, since update lookup queries are always on a single document, they would ever be able to observe an inconsistent state. Even though writes are applied in parallel on secondaries, every document should presumably go through a series of states that is the same as it was on primary, even if the global database state is not consistent. It's still probably not a great idea to allow update lookup reads during batch application, though.

Comment by William Schultz (Inactive) [ 07/Feb/19 ]

This is a hypothesized bug but has not been reproduced yet.

Generated at Thu Feb 08 04:51:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.