[SERVER-45008] Make sure allDatabaseCloner completion callback runs on correct executor Created: 06/Dec/19  Updated: 29/Oct/23  Resolved: 10/Dec/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.3.3

Type: Bug Priority: Major - P3
Reporter: Matthew Russotto Assignee: Matthew Russotto
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Sprint: Repl 2019-12-16
Participants:
Linked BF Score: 0

 Description   

The fix for SERVER-44809 was wrong (and this ticket reverts that ticket). It appears what actually happens is the destruction of the lambda for the allDatabaseCloner runs asynchronously, on the cloner executor, after the future is made ready. This can result in execution of the finishCallback for the onCompletion guard being run on the cloner executor after we enter net->runUntil(), which results in the next attempt being scheduled too late.

Destroying the onCompletion shared pointer in the lambda while holding the initial syncer mutex ensures the final destruction happens somewhere else, since at that point we know there are other references to the shared pointer (except in the shutdown case)



 Comments   
Comment by Githook User [ 10/Dec/19 ]

Author:

{'email': 'matthew.russotto@mongodb.com', 'name': 'Matthew Russotto', 'username': 'mtrussotto'}

Message: SERVER-45008 Make sure allDatabaseCloner completion callback runs on correct executor
Branch: master
https://github.com/mongodb/mongo/commit/782899c72e4af6440e8e6369066a14eee26ed348

Comment by Matthew Russotto [ 06/Dec/19 ]

Tested by creating a deliberately slow-to-destroy object to be move-captured in the lambda:

  class slow_t {
      public:
        slow_t(){}
        ~slow_t() {
            log() << "HELLO";
            sleep(1);
            log() << "GOODBYE";
        }
    } slow;
...
        .onCompletion([this, onCompletionGuard, so=std::move(slow)](Status status) mutable {

Noted this caused the hang. Then added completionguard destruction and noted no failure.

Generated at Thu Feb 08 05:07:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.