[SERVER-27772] processing afterClusterTime > clusterTime on secondary Created: 20/Jan/17  Updated: 06/Dec/17  Resolved: 07/Apr/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.5.6

Type: Task Priority: Major - P3
Reporter: Misha Tyulenev Assignee: Misha Tyulenev
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-28559 appendOplogNote command needs to ensu... Closed
Related
related to SERVER-28559 appendOplogNote command needs to ensu... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2017-03-27, Sharding 2017-04-17
Participants:

 Description   

In a multishard environment a mongos that sends read requests to multiple secondaries may run into a performance degradation if the readAfterClusterTime specifies a time that is ahead of the clusterTime of the primary. In this case a secondary which will learn the time at the moment the command arrives will communicate the primary with a heartbeat and then wait for the oplog to replicate fully to match the readConcern. This can be a significant 2 sec heartbeat wait + 10 sec noop writer wait + replication wait i.e. > 10 seconds delay.

To solve it need to advance LogicalTime_LOG the secondary. For that it will need to
communicate to the primary the new clusterTime so the primary can set its LogicalTime_MEM. Once the primary receives the message it will advance its time as described in the “Primary operation” subsection.
wait until the oplog entry with the aforementioned clusterTime is replicated.

1. Add a global function noopWrite . Note: its global cause there is no good place for it. I plan put it into the read_concern.cpp.
(alternatively can be a method on ReplicationCoordinator)

Status ReplicationCoordinatorImpl::noopWrite(OperationContext* opCtx, BSONObj msgObj, StringData note) {
    Lock::GlobalLock lock(opCtx, MODE_IX, 1);
    if (!lock.isLocked()) {
        return {ErrorCodes::LockFailed, "Global lock is not available"};
    }
    opCtx->lockState()->lockMMAPV1Flush();
 
    // Its a proxy for being a primary passing "local" will cause it to return true on secondary
   
    auto replCoord = repl::ReplicationCoordinator::get(opCtx);
    if (!replCoord->canAcceptWritesForDatabase(opCtx, "admin")) {
        return {ErrorCodes::NotMaster, "Not a primary"};
    }
 
    MONGO_WRITE_CONFLICT_RETRY_LOOP_BEGIN {
        WriteUnitOfWork uow(opCtx);
        opCtx->getClient()->getServiceContext()->getOpObserver()->onOpMessage(opCtx, msgObj);
        uow.commit();
    }
    MONGO_WRITE_CONFLICT_RETRY_LOOP_END(opCtx, note, repl::rsOplogName);
    return Status::OK();
}

2. to catch up the oplog with cluster time do noop write to the local oplog if its on the primary node or send to the primary if on the secondary

Status makeNoopWriteIfNeeded(OperationContext* opCtx,
                                                  LogicalTime clusterTime)
{
   auto replCoord = repl::ReplicationCoordinator::get(opCtx);
   auto currentTime = replCoord->getMyLastAppliedOpTime();
   if (clusterTime > currentTime) {
 
        // use Shard::runCommand with "PrimaryOnly: readPreference and idempotent retries
        auto shardingState = ShardingState::get(opCtx);
        invariant(shardingState);
        auto myShard = grid->shardRegistry(shardingState->getShardName());
        auto swRes = myShard->runCommand(opCtx, <"PrimaryOnly">, "admin"  // TODO: add jira to return CanNOtTargetItself if it becomes primary , catch in the status and issue direct noopWrite
                BSON("applyOpLogNote" << 1 << "data" << BSON("append noop write" << 1)), iIdempotent);
       return swRes.status;
     }

3. call makeNoopWriteIfNeeded from waitForReadConcern so it will attempt to catch up the oplog.

4. Rewrite appendOlpogNote

a. use MONGO_INITIALIZER instead of static initialization
b. in the run method call the ReplicationCoordinatorImpl::noopWrite method unless the cmdObj has a clusterTime <= lastAppliedOpTime

    virtual bool run(OperationContext* opCtx,
                     const string& dbname,
                     BSONObj& cmdObj,
                     int,
                     string& errmsg,
                     BSONObjBuilder& result) {
        BSONElement dataElement;
        auto dataStatus = bsonExtractTypedField(cmdObj, "data", Object, &dataElement);
        if (!dataStatus.isOK()) {
            return appendCommandStatus(result, dataStatus);
        }
        auto replCoord = repl::ReplicationCoordinator::get(opCtx);
        if (!replCoord->isReplEnabled()) {
            return appendCommandStatus(result, {ErrorCodes::NoReplicationEnabled, "Must have replication set up to run \"appendOplogNote\""});
        }
        return appendCommandStatus(result, noopWrite(opCtx, dataElement.Obj(), "appendOpLogNote"));
    }
};



 Comments   
Comment by Githook User [ 07/Apr/17 ]

Author:

{u'username': u'mikety', u'name': u'Misha Tyulenev', u'email': u'misha@mongodb.com'}

Message: SERVER-27772 Force noop write on primary when readConcern:afterClusterTime > clusterTime on secondaries
Branch: master
https://github.com/mongodb/mongo/commit/fed0e2cc62a65067c7f2991554206a95d2619172

Comment by Kaloian Manassiev [ 05/Apr/17 ]

LGTM, but please file a ticket to consider what the effects can be of having multiple concurrent requests potentially doing wasteful no-op writes. Also, the read preference for the runCommand cannot be "secondary", because you want it to find the primary.

Comment by Misha Tyulenev [ 05/Apr/17 ]

Thanks renctan yes and yes.
Note the changes in the 4. - in the case to not using clusterTime from the command it makes an unconditional write (because logical time always >= than lastAppliedOpTime) That will exaggerate the problem Kal has mentioned if concurrent threads are writing here

Comment by Randolph Tan [ 05/Apr/17 ]

I think the overall structure looks good. I'll have to defer to someone else on the proper locking inside ReplicationCoordinatorImpl::noopWrite. Here are some minor comments:

  • makeNoopWriteIfNeeded should be called outside the mutex in ReplicationCoordinatorImpl::_waitUntilClusterTimeForRead
  • I thought we agreed that appendOlpogNote doesn't need the extra clusterTime parameter?
Comment by Misha Tyulenev [ 04/Apr/17 ]

kaloian.manassiev, renctan, spencer thanks for the feedback. Summary of updates:
a. move the noopWrite and makeNoopWriteIfNeeded to read_concern.cpp those are global functions. Im personally ok to put it anywhere (i.e make a methods of ReplicationCoordinator to avoid decoration lookups so plese express you preference if disagree)
b. use myLastAppliedOpTime in makeNoopWriteIfNeeded()
c. use Shard::runCommand with "secondary: readPreference and idempotent policy retries. This version may skip the cases of transition from/to primary but it should be ok as in this case the opLog will catch up pretty fast with other writes.
I will defer the addressing Kal's concerns of

  • race with many threads writing noop
  • refactoring readCOncern
Comment by Kaloian Manassiev [ 04/Apr/17 ]

How about we try to not involve the replication coordinator at all except perhaps for a fast check of whether we are primary or not?

If you hook it into waitForReadConcern, you can do a quick (unsafe) check for two things:

  • Is the afterClisterTime > clusterTime (using the logical clock directly)
  • Is the node primary

If the node is primary, opportunistically call the noopWrite function directly. This might fail because the node is no longer master, but you can just loop around and execute the same code and checks again. If it is not, use ShardRegistry::runCommand to call into the primary - this call will ensure that the primary is found and will do the necessary retries on NotMaster and other transient network errors.

Another thing which I think is important to keep in mind is to not have multiple threads do the no-op write for the same value, so it might be useful to spawn a thread which will write the largest value found.

I also noticed that ShardLocal is using waitUntilOpTimeForRead from the ReplicationCoordinator directly and duplicates the functionality from waitForReadConcern. Not sure why this is necessary, but also because this code only runs on the primary and is only called by internal threads (not threads receiving user requests) I think it is safe to leave it as is.

Comment by Randolph Tan [ 04/Apr/17 ]

Comments for _makeNoopWriteIfNeeded_inlock:

Why does it call _getCurrentCommittedLogicalTime_inlock? I think it should be comparing against the opTime of the lastest write since that is the highest opTime the current node has.

Comment by Misha Tyulenev [ 31/Mar/17 ]

Back to the drawing board)
spencer, kaloian.manassiev, renctan please lmk the feedback

Generated at Thu Feb 08 04:16:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.