[SERVER-80280] Consider introducing concept of draining internal readers after stepdown and before starting secondary oplog application Created: 21/Aug/23  Updated: 25/Sep/23  Resolved: 25/Sep/23

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-79955 Need a more complete mechanism for in... Closed
Related
is related to SERVER-59775 ReshardingDonorOplogIterator triggers... Closed
is related to SERVER-80200 Temporarily do not enforce constraint... Closed
is related to SERVER-48452 Internal readers should default to re... Closed
Assigned Teams:
Replication
Participants:

 Description   

The changes from SERVER-48452 enforce that internal readers on mongod must not read at ReadSource::kNoTimestamp while the mongod is in replica set member state SECONDARY. Instead, the internal readers must support relaxing their consistency model and read at the earlier ReadSource::kLastApplied or an fassert() is triggered. This is because reads at ReadSource::kNoTimestamp on a secondary would otherwise see partial effects of secondary oplog application as new snapshots are acquired and would potentially lead to other anomalous behavior.

However, this poses a problem because internal readers survive stepdown and would have been reading at ReadSource::kNoTimestamp while the mongod was in replica set member state PRIMARY. Services such as resharding and chunk migration have therefore triggered this fassert() in practice (SERVER-59775, SERVER-80200) despite there being no meaningful harm if they were to read at ReadSource::kNoTimestamp. Resharding and chunk migration would eventually fail with ErrorCodes::NotWritablePrimary in replica set member state SECONDARY because their anomalous read is always later followed by a write which requires still being the primary.

The shouldReadAtLastApplied() function consults the replica set member state for making the decision as to whether or not to trigger the fassert(). Reads at ReadSource::kNoTimestamp while the mongod was in replica set member state SECONDARY are still valid so long as secondary oplog application has not yet begun. But the shouldReadAtLastApplied() function cannot express this condition precisely enough because the ReplicationCoordinatorImpl doesn't offer a drain mode after stepdown where services which are only run in replica set member state PRIMARY are guaranteed to have quiesced. And it may be for good reason - services acknowledging interruption is only ever best-effort and delaying secondary oplog application from starting could be worse for the application and majority-commit latency.

bool shouldReadAtLastApplied(OperationContext* opCtx,
                             boost::optional<const NamespaceString&> nss,
                             std::string* reason) {
    ...
 
    // If this node can accept writes (i.e. primary), then no conflicting replication batches are
    // being applied and we can read from the default snapshot. If we are in a replication state
    // (like secondary or primary catch-up) where we are not accepting writes, we should read at
    // lastApplied.
    if (repl::ReplicationCoordinator::get(opCtx)->canAcceptWritesForDatabase(
            opCtx, DatabaseName::kAdmin)) {
        if (reason) {
            *reason = "primary";
        }
        return false;
    }
 
    // If we are not secondary, then we should not attempt to read at lastApplied because it may not
    // be available or valid. Any operations reading outside of the primary or secondary states must
    // be internal. We give these operations the benefit of the doubt rather than attempting to read
    // at a lastApplied timestamp that is not valid.
    if (!repl::ReplicationCoordinator::get(opCtx)->isInPrimaryOrSecondaryState(opCtx)) {
        if (reason) {
            *reason = "not primary or secondary";
        }
        return false;
    }
 
    ...
}



 Comments   
Comment by Opal Hoyt [ 25/Sep/23 ]

Closing this as Won't Fix based on the resolution for SERVER-79955 

Comment by Samyukta Lanka [ 21/Aug/23 ]

Lock free reads currently retry if the term from before the read does not match the term from after. max.hirschhorn@mongodb.com and I briefly discussed that another alternative might be providing a mode of lock free reads that kills the operation if the term changes during the course of a read.

Comment by Max Hirschhorn [ 21/Aug/23 ]

One alternative to adding this form of synchronization on stepdown could otherwise be to guarantee internal readers all acquire the RSTL lock and were formerly label with OperationContext::_alwaysInterruptAtStepDownOrUp. There isn't a mode of lock-free reads which expresses this particular combination and so some collaboration with the Storage Execution team would be needed.

Generated at Thu Feb 08 06:43:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.