[SERVER-29382] Fix cancel race related to the timeout in ReplicationCoordinatorImpl::CatchupState Created: 26/May/17  Updated: 30/Oct/23  Resolved: 12/Jun/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.4.6, 3.5.9

Type: Bug Priority: Major - P3
Reporter: Matthew Russotto Assignee: Siyuan Zhou
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4
Sprint: Repl 2017-06-19
Participants:
Linked BF Score: 0

 Description   

If we call ReplicationCoordinatorImpl::CatchupState::abort_inlock() at a time when the cancel callback lambda is in progress, we may delete the CatchupState out from under the cancel callback.

We can fix this by
1) Capturing the mutex reference directly in the callback lambda rather than accessing it via this->_repl. This mutex belongs to the ReplicationCoordinatorImpl and not the CatchupState, so it survives deletion of the CatchupState.

2) Check callback cancellation only after acquiring the mutex. This allows a late cancel to work, provided it's done while the lock is held, which is always true because CatchupState is required to be accessed only under lock.



 Comments   
Comment by Githook User [ 16/Jun/17 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-29382 Fix cancel race related to the timeout in ReplicationCoordinatorImpl::CatchupState

(cherry picked from commit a66d41d71a4b7215ad6ba37adc392352da6ec477)
Branch: v3.4
https://github.com/mongodb/mongo/commit/c4ede1aaa722758584fe8436e8ffc2ead62d4d94

Comment by Githook User [ 12/Jun/17 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-29382 Fix cancel race related to the timeout in ReplicationCoordinatorImpl::CatchupState
Branch: master
https://github.com/mongodb/mongo/commit/a66d41d71a4b7215ad6ba37adc392352da6ec477

Comment by Benety Goh [ 26/May/17 ]

It's probably a bad idea to let CatchUpState delete itself, especially when it does not wait for the canceled callback to complete.

https://github.com/mongodb/mongo/blob/b035e46ec65088885d8b934af235481f294af77f/src/mongo/db/repl/replication_coordinator_impl.cpp#L2542

replication_coordinator_impl.cpp

2527
void ReplicationCoordinatorImpl::CatchupState::abort_inlock() {
2528
    invariant(_repl->_getMemberState_inlock().primary());
2529
 
2530
    log() << "Exited primary catch-up mode.";
2531
    // Clean up its own members.
2532
    if (_timeoutCbh) {
2533
        _repl->_replExecutor->cancel(_timeoutCbh);
2534
    }
2535
    if (_waiter) {
2536
        _repl->_opTimeWaiterList.remove_inlock(_waiter.get());
2537
    }
2538
 
2539
    // Enter primary drain mode.
2540
    _repl->_enterDrainMode_inlock();
2541
    // Destruct the state itself.
2542
    _repl->_catchupState.reset(nullptr);
2543
}

Generated at Thu Feb 08 04:20:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.