Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-27772

processing afterClusterTime > clusterTime on secondary

    • Type: Icon: Task Task
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.5.6
    • Affects Version/s: None
    • Component/s: Sharding
    • Labels:
      None
    • Fully Compatible
    • Sharding 2017-03-27, Sharding 2017-04-17

      In a multishard environment a mongos that sends read requests to multiple secondaries may run into a performance degradation if the readAfterClusterTime specifies a time that is ahead of the clusterTime of the primary. In this case a secondary which will learn the time at the moment the command arrives will communicate the primary with a heartbeat and then wait for the oplog to replicate fully to match the readConcern. This can be a significant 2 sec heartbeat wait + 10 sec noop writer wait + replication wait i.e. > 10 seconds delay.

      To solve it need to advance LogicalTime_LOG the secondary. For that it will need to
      communicate to the primary the new clusterTime so the primary can set its LogicalTime_MEM. Once the primary receives the message it will advance its time as described in the “Primary operation” subsection.
      wait until the oplog entry with the aforementioned clusterTime is replicated.

      1. Add a global function noopWrite . Note: its global cause there is no good place for it. I plan put it into the read_concern.cpp.
      (alternatively can be a method on ReplicationCoordinator)

      Status ReplicationCoordinatorImpl::noopWrite(OperationContext* opCtx, BSONObj msgObj, StringData note) {
          Lock::GlobalLock lock(opCtx, MODE_IX, 1);
          if (!lock.isLocked()) {
              return {ErrorCodes::LockFailed, "Global lock is not available"};
          }
          opCtx->lockState()->lockMMAPV1Flush();
      
          // Its a proxy for being a primary passing "local" will cause it to return true on secondary
         
          auto replCoord = repl::ReplicationCoordinator::get(opCtx);
          if (!replCoord->canAcceptWritesForDatabase(opCtx, "admin")) {
              return {ErrorCodes::NotMaster, "Not a primary"};
          }
      
          MONGO_WRITE_CONFLICT_RETRY_LOOP_BEGIN {
              WriteUnitOfWork uow(opCtx);
              opCtx->getClient()->getServiceContext()->getOpObserver()->onOpMessage(opCtx, msgObj);
              uow.commit();
          }
          MONGO_WRITE_CONFLICT_RETRY_LOOP_END(opCtx, note, repl::rsOplogName);
          return Status::OK();
      }
      

      2. to catch up the oplog with cluster time do noop write to the local oplog if its on the primary node or send to the primary if on the secondary

      Status makeNoopWriteIfNeeded(OperationContext* opCtx,
                                                        LogicalTime clusterTime)
      {
         auto replCoord = repl::ReplicationCoordinator::get(opCtx);
         auto currentTime = replCoord->getMyLastAppliedOpTime();
         if (clusterTime > currentTime) {
      
              // use Shard::runCommand with "PrimaryOnly: readPreference and idempotent retries
              auto shardingState = ShardingState::get(opCtx);
              invariant(shardingState);
              auto myShard = grid->shardRegistry(shardingState->getShardName());
              auto swRes = myShard->runCommand(opCtx, <"PrimaryOnly">, "admin"  // TODO: add jira to return CanNOtTargetItself if it becomes primary , catch in the status and issue direct noopWrite
                      BSON("applyOpLogNote" << 1 << "data" << BSON("append noop write" << 1)), iIdempotent);
             return swRes.status;
           }
      

      3. call makeNoopWriteIfNeeded from waitForReadConcern so it will attempt to catch up the oplog.

      4. Rewrite appendOlpogNote

      a. use MONGO_INITIALIZER instead of static initialization
      b. in the run method call the ReplicationCoordinatorImpl::noopWrite method unless the cmdObj has a clusterTime <= lastAppliedOpTime

          virtual bool run(OperationContext* opCtx,
                           const string& dbname,
                           BSONObj& cmdObj,
                           int,
                           string& errmsg,
                           BSONObjBuilder& result) {
              BSONElement dataElement;
              auto dataStatus = bsonExtractTypedField(cmdObj, "data", Object, &dataElement);
              if (!dataStatus.isOK()) {
                  return appendCommandStatus(result, dataStatus);
              }
              auto replCoord = repl::ReplicationCoordinator::get(opCtx);
              if (!replCoord->isReplEnabled()) {
                  return appendCommandStatus(result, {ErrorCodes::NoReplicationEnabled, "Must have replication set up to run \"appendOplogNote\""});
              }
              return appendCommandStatus(result, noopWrite(opCtx, dataElement.Obj(), "appendOpLogNote"));
          }
      };
      
      

            Assignee:
            misha.tyulenev@mongodb.com Misha Tyulenev (Inactive)
            Reporter:
            misha.tyulenev@mongodb.com Misha Tyulenev (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: