[SERVER-28107]  Augment command result with operationTime in mongos Created: 24/Feb/17  Updated: 05/Apr/17  Resolved: 22/Mar/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.5.5

Type: Task Priority: Major - P3
Reporter: Misha Tyulenev Assignee: Misha Tyulenev
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Sprint: Sharding 2017-03-27
Participants:

 Description   

While the operationTime is returned by the mongod its lost in translation in mongos. There are several passthrough sharding tests that fail so its writeOps but it may affect other command results as well.
The testsuites that are excluding files are:
sharding_jscore_passthrough
sharded_collections_jscore_passthrough

for 3.6. mongos will be executing
Strategy::commandOp and Strategy::writeOp with the new causal clients, hence the responses need to include the latest operationTime received from the mongods.
As the commands in mongos will be processed via ASIO , therefore a random thread will be handling the response, hence we need to augment the txn with decoration that keeps the current operation's time.

Hence the task can be split into 4 parts:

1. Augment the txn with decoration that keeps the current operation's time

class OperationTimeTracker {
public:  
   // Decorate OperationContext with OperationTimeTracker instance.
   static OperationTimeTracker* get(OperationContext* opCtx);
 
   LogicalTime getMaxOperationTime() const;
 
   // updates maxOperationTime with the max(new, current)
   void updateOperationTime(LogicalTime);
private:
   // protects _maxOperationTime
   stdx::mutex _mutex;
   LogicalTime _maxOperationTime;
};

2.
Accepted design:
Subclass TaskExecutor as ShardingTaskExecutor
Forward all commands to the _executor member, and
override the scheduleRemoteCommand as

ShardingTaskExecutor(unique_ptr<ThreadPoolTaskExecutor>)
 
StatusWith<TaskExecutor::CallbackHandle> ShardingTaskExecutor::scheduleRemoteCommand(
    const RemoteCommandRequest& request, const RemoteCommandCallbackFn& cb) {
  
    auto shardingCb = [opCtx = request.opCtx, cb](const executor::TaskExecutor::RemoteCommandCallbackArgs& args)  {
        auto res = cb(args); 
        if (!res.isOK()) {
        }
       // extract operationTime from the args.response
 
       // updateoperationTime to the decoration 
       OperationTimeTracker::get(opCtx)->updateOperationTime(operationTime);
    }
 
    _executor->scheduleRemoteCommand(request, shardingCb) 
}
private:
   unique_ptr<ThreadPoolTaskExecutor> _executor;

Create instance of ShardingTaskExecutor in the sharding_initialization.c makeTaskExecutor method.

Alternative design (rejected) :
Refactor EgressMetadataHook to process command reply as well as metadata.

  • Add method readCommandReply(OperationContext* txn, const BSONObj& commandReply) to the EgressHook API
  • Implement the readCommandReply in ShardingEgressMetadataHookForMongos it should be an invariant(false) on all other hook implementations:
    a) parse the operationTime out of the commandReply
    b) get the tracker -i.e. OperationTimeTracker::get(txn)
    c) call updateOperationTime(operationTime)

3. Access tracker when the results from shards arrive to mongos:
a) In AsyncCommand::response extract OperationContext from op->request() make sure its not null as it potentially can be.
b) Pass the OperationContext to decodeRPC now the decodeRPC has the commandReply and txn. Txn must exists if metadata is attached. (Question to resolve: should it use invariant to enforce, or return error status, or log error and skip processing the commandReply)
c) Call the EgressMetadataHook::readCommandReply method.

4. The commandOp results are finalized in

mongo::execCommandClient

, hence in that function the current value of operationTime stored in the decoration should be used in the reply to client.

    auto commandOperationTime = OperationTimeTracker::get(txn)->getMaxOperationTime();
    Command::appendCommandStatus(result, commandOperationTime);  



 Comments   
Comment by Githook User [ 22/Mar/17 ]

Author:

{u'username': u'mikety', u'name': u'Misha Tyulenev', u'email': u'misha@mongodb.com'}

Message: SERVER-28107 augment command result with operationTime in mongos
Branch: master
https://github.com/mongodb/mongo/commit/4d364a4c951bb05639335d5989c1f85e79af78fa

Comment by Kaloian Manassiev [ 14/Mar/17 ]

My only comment would be to not call the argument you capture in the lambda opCtx so it is not confused to be the operation context of the thread, which is running the lambda. Maybe call it clientCommandOpCtx so it is not confusing.

Another thing which could be considered is to make the OperationTimeTracker decoration be a shared pointer instead and only capture it in the lambda. That way even if the original client operation returns, destroying the operation context, the task executor's thread will not access an invalid pointer.

Comment by Githook User [ 10/Mar/17 ]

Author:

{u'username': u'jsmulrow', u'name': u'Jack Mulrow', u'email': u'jack.mulrow@mongodb.com'}

Message: SERVER-28107 Implement OperationTimeTracker
Branch: master
https://github.com/mongodb/mongo/commit/76b8e0443d4b1e0a0533d04768c7c0339bc2ad0d

Comment by Misha Tyulenev [ 09/Mar/17 ]

We can follow up and remove operationTime processing from hooks once the project to introduce a scatrer/gather shard processing - esha.maharishi works on it - is committed. But until now I feel that hooks approach is acceptable as it allows us to make progress faster.
Agreed NOT changing the class names, removed from the spec.

Comment by Samantha Ritter (Inactive) [ 09/Mar/17 ]

I also don't feel a hook is the right place for this. The existing hooks are meant to allow you to interject things into the NetworkInterfaceASIO state machine, which is otherwise a black box. For example, metadata is parsed in the middle of the state machine, so if you needed to access that particular point in the connection you couldn't do it without the provided hook. The task you are describing involves parsing the command reply, which happens at the end of the state machine, not in the middle of the state machine. In other words, you do not need to be inside the black box of NIA to do this work. I understand why a hook that runs right before NIA returns your response would be a convenient place to put this, but logically it feels like a mismatch. I also think the performance counter-argument is worth investigating.

If we/you do decide to go with the hooks approach anyway, I feel very strongly against changing the name of the hook to "EgressHook," that name doesn't mean anything to me. The hooks' names should tell the user the point at which they are interrupting the NIA state machine (for example EgressMetadataHook implies that we are hooking into the machine at the point where we receive and process metadata).

Comment by Kaloian Manassiev [ 09/Mar/17 ]

renctan - if we don't do it through the hooks we risk at some point someone introducing a new network command execution path which doesn't do the cluster time aggregation and incorrect values to be returned to the client. By doing it in the hooks and at the time when the command response is generated we capture the beginning of the request and the end of it and nobody has to think about it.

The argument about picking up the time of the config server is not really a problem, because if we read some chunks metadata with a later cluster time, chances are that whichever shard receives a read with that cluster time will also need that up-to-date metadata as well.

Comment by Misha Tyulenev [ 09/Mar/17 ]

Per offline discussion the choice between hooks and ARM/parallel is the choice between simplicity and performance as the hooks approach may introduce extra wait on the client. However "hooks" offer uniform approach that will take case of all the code paths. The consensus is to go with the proposed "hook approach".

Comment by Randolph Tan [ 09/Mar/17 ]

Not sure the hooks are the right place to do it. One argument against it is that it's indicriminate - it will pick up other responses that are not directly related to the current operation. For example, a chunk manager refresh causes it to store the operation time from the config server while executing a write command to the shards. I feel like the operation time can be better handled by the one aggregrating the results, like the AsyncResultsMerger/ParallelSortClusteredCursor or whatever the replacement will be called.

If we are going for the hook approach, maybe we should just modify the existing method signature to accept txn and rename it to readReply instead of creating a new one since the new OP_COMMAND protocol will no longer have a separate metadata section.

Comment by Misha Tyulenev [ 09/Mar/17 ]

kaloian.manassiev renctan samantha.ritter could you please review the proposal and submit the feedback via any convenient for you channel?

Generated at Thu Feb 08 04:17:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.