Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-80280

Consider introducing concept of draining internal readers after stepdown and before starting secondary oplog application

    • Type: Icon: New Feature New Feature
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
    • Replication

      The changes from SERVER-48452 enforce that internal readers on mongod must not read at ReadSource::kNoTimestamp while the mongod is in replica set member state SECONDARY. Instead, the internal readers must support relaxing their consistency model and read at the earlier ReadSource::kLastApplied or an fassert() is triggered. This is because reads at ReadSource::kNoTimestamp on a secondary would otherwise see partial effects of secondary oplog application as new snapshots are acquired and would potentially lead to other anomalous behavior.

      However, this poses a problem because internal readers survive stepdown and would have been reading at ReadSource::kNoTimestamp while the mongod was in replica set member state PRIMARY. Services such as resharding and chunk migration have therefore triggered this fassert() in practice (SERVER-59775, SERVER-80200) despite there being no meaningful harm if they were to read at ReadSource::kNoTimestamp. Resharding and chunk migration would eventually fail with ErrorCodes::NotWritablePrimary in replica set member state SECONDARY because their anomalous read is always later followed by a write which requires still being the primary.

      The shouldReadAtLastApplied() function consults the replica set member state for making the decision as to whether or not to trigger the fassert(). Reads at ReadSource::kNoTimestamp while the mongod was in replica set member state SECONDARY are still valid so long as secondary oplog application has not yet begun. But the shouldReadAtLastApplied() function cannot express this condition precisely enough because the ReplicationCoordinatorImpl doesn't offer a drain mode after stepdown where services which are only run in replica set member state PRIMARY are guaranteed to have quiesced. And it may be for good reason - services acknowledging interruption is only ever best-effort and delaying secondary oplog application from starting could be worse for the application and majority-commit latency.

      bool shouldReadAtLastApplied(OperationContext* opCtx,
                                   boost::optional<const NamespaceString&> nss,
                                   std::string* reason) {
          // If this node can accept writes (i.e. primary), then no conflicting replication batches are
          // being applied and we can read from the default snapshot. If we are in a replication state
          // (like secondary or primary catch-up) where we are not accepting writes, we should read at
          // lastApplied.
          if (repl::ReplicationCoordinator::get(opCtx)->canAcceptWritesForDatabase(
                  opCtx, DatabaseName::kAdmin)) {
              if (reason) {
                  *reason = "primary";
              return false;
          // If we are not secondary, then we should not attempt to read at lastApplied because it may not
          // be available or valid. Any operations reading outside of the primary or secondary states must
          // be internal. We give these operations the benefit of the doubt rather than attempting to read
          // at a lastApplied timestamp that is not valid.
          if (!repl::ReplicationCoordinator::get(opCtx)->isInPrimaryOrSecondaryState(opCtx)) {
              if (reason) {
                  *reason = "not primary or secondary";
              return false;

            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            max.hirschhorn@mongodb.com Max Hirschhorn
            0 Vote for this issue
            10 Start watching this issue