[SERVER-44529] Re-acquiring locks after a yield and step down results in wrong parameters on recoveryUnit Created: 08/Nov/19  Updated: 29/Oct/23  Resolved: 15/May/20

Status: Closed
Project: Core Server
Component/s: Querying, Replication, Storage
Affects Version/s: None
Fix Version/s: 4.4.0-rc7, 4.2.9, 4.7.0

Type: Bug Priority: Critical - P2
Reporter: Matthew Russotto Assignee: Louis Williams
Resolution: Fixed Votes: 0
Labels: bkp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-46721 Step up may cause reads at PIT with h... Closed
Related
related to SERVER-49781 Setting lastApplied before startup re... Closed
is related to SERVER-47084 Support AutoGetCollectionForRead-like... Closed
is related to SERVER-48475 Secondary reads always calculate all_... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2019-12-02, Execution Team 2019-12-16, Execution Team 2020-01-13, Execution Team 2020-01-27, Execution Team 2020-02-10, Execution Team 2019-12-30, Execution Team 2020-02-24, Execution Team 2020-03-09, Execution Team 2020-03-23, Execution Team 2020-04-06, Execution Team 2020-05-04, Execution Team 2020-05-18
Participants:
Linked BF Score: 13

 Description   

When we acquire collection locks with AutoGetCollectionForRead on a primary, we by default set the TimestampReadSource on the recovery unit to kUnset. However, when we re-acquire locks after a yield in PlanExecutor, we still use that same TimestampReadSource even if we have stepped down. Since reads survive replication state changes, that can result in us reading from within the middle of a replication batch, which means reading possibly causally-related data out of order.

Note: This ticket initially described both step up and step down, but step up is now separated out into SERVER-46721.

Potential fix: After recovery from yield, change the read source to kNoOverlap if it was kUnset or kNoTimestamp.



 Comments   
Comment by Githook User [ 06/Aug/20 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-46721 Secondary readers should read at the no-overlap time instead of lastApplied

The no-overlap time, ReadSource::kNoOverlap, is the minimum of replication's lastApplied timestamp
and WiredTiger's all_durable time. This time is independent of replication state and ensures
queries do not see oplog holes after state transitions from secondary to primary.

(cherry picked from commit 25c694f365db0f07a445bd17b6cd5cbf32f5f2f9)
(cherry picked from commit 5ddf9db5635e4af2d1295bd2f1007e86d517c6c2)

SERVER-44529 Query yield recovery after a stepdown should switch to read at the no-overlap time

After yielding and reacquiring locks in a query, the preconditions that were used to select our
ReadSource initially need to be checked again. Queries hold an AutoGetCollectionForRead RAII
lock for their lifetime, which may select a ReadSource based on state (e.g. replication
state). After a query yields its locks, this state may have changed, invalidating our current
choice of ReadSource.

(cherry picked from commit b3a5b5258a487006f0487b9e7e0a1d0d4f1119ff)
(cherry picked from commit ce57ddce19e642401ec1b56871b3a9402116ccfc)

SERVER-48475 Reimplement lastApplied for secondary reads

This partially reverts work to use the kNoOverlap ReadSource on secondaries since the all_durable
calculation is unnecessary and expensive.

(cherry picked from commit ff92d4435fa75c2b947c49cffaf48805b320a5ae)
(cherry picked from commit 4e3f3b96826772baa777b7563b38574c1e11c2e4)
Branch: v4.2
https://github.com/mongodb/mongo/commit/caef6619556ddeb20adf8e2bad651b96d2b3ea70

Comment by Louis Williams [ 09/Jun/20 ]

Due to a small performance regression, I have partially reverted this work in SERVER-48475. Secondary reads no longer use kNoOverlap, but use kLastApplied as they did before SERVER-46721. Since the no-overlap point on secondaries is always equal to lastApplied, we were performing unnecessary all_durable calculations for every read.

Comment by Githook User [ 19/May/20 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-44529 Query yield recovery after a stepdown should switch to read at the no-overlap time

After yielding and reacquiring locks in a query, the preconditions that were used to select our
ReadSource initially need to be checked again. Queries hold an AutoGetCollectionForRead RAII
lock for their lifetime, which may select a ReadSource based on state (e.g. replication
state). After a query yields its locks, this state may have changed, invalidating our current
choice of ReadSource.

(cherry picked from commit b3a5b5258a487006f0487b9e7e0a1d0d4f1119ff)
Branch: v4.4
https://github.com/mongodb/mongo/commit/ce57ddce19e642401ec1b56871b3a9402116ccfc

Comment by Githook User [ 15/May/20 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-44529 Query yield recovery after a stepdown should switch to read at the no-overlap time

After yielding and reacquiring locks in a query, the preconditions that were used to select our
ReadSource initially need to be checked again. Queries hold an AutoGetCollectionForRead RAII
lock for their lifetime, which may select a ReadSource based on state (e.g. replication
state). After a query yields its locks, this state may have changed, invalidating our current
choice of ReadSource.
Branch: master
https://github.com/mongodb/mongo/commit/b3a5b5258a487006f0487b9e7e0a1d0d4f1119ff

Comment by Louis Williams [ 28/Apr/20 ]

Before SERVER-38511, reads did not survive the transition from primary to secondary, so I am declining the 4.0 backport.

Comment by Louis Williams [ 27/Apr/20 ]

I've spent some time rationalizing why the proposed solution is correct. The main problem that we want to avoid is allowing an operation to read without a timestamp and then switch to reading with a timestamp. Unaddressed, this visibility problem will cause will once-visible writes to vanish (see SERVER-41361), and can lead to crashes.

The proposed solution in the event of a stepdown after recovering from a yield is the following: if we were reading without a timestamp before stepdown, read at kNoOverlap, which is the minimum of lastApplied and all_durable, falling-back to the solution in SERVER-46721.

This is a scenario I believe demonstrates that it is not possible for a snapshot to "backtrack".  For example, say the all_durable is T1 because of a hole at T2. A client reads without a timestamp, seeing writes at T3, which is equal to lastApplied. If a stepdown occurs, all holes are closed because active transactions will either commit just-in-time or be aborted by stepdown. This means that immediately after stepdown, all_durable becomes equal to lastApplied, or T3 (assuming no other writes). More importantly, if we started a read without a timestamp just before stepdown, after stepdown it would be impossible for kNoOverlap (the minimum of all_durable and lastApplied) to move backward and read at a timestamp before T3. This should provide an example of why this solution is safe for replicated collections.

For the oplog: reading without a timestamp is always safe, regardless of the replication state. We track an oplog visibility, "no holes", timestamp. Our solution should take care to ensure the oplog opts-out of this new behavior.

Update: my last statement about reading from the oplog on secondaries being safe is not confirmed

Update: we need a different solution on 4.0 where the kNoOverlap ReadSource doesn't exist. We will probably need to partially backport SERVER-38986, where it was introduced.

Comment by Matthew Russotto [ 26/Feb/20 ]

It looks like this bug exists on 4.0, but the 3.6 AutoGetCollectionForRead is completely different so it shouldn't be affected.

Comment by Tess Avitabile (Inactive) [ 26/Feb/20 ]

Does this need to be backported further back than 4.2? I think it's the case on earlier versions that reads survive the transition from secondary to primary.

Comment by Steven Vannelli [ 11/Nov/19 ]

Assigning this to the Execution. Please reach out to Replication team if support is needed.

Generated at Thu Feb 08 05:06:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.