Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-29382

Fix cancel race related to the timeout in ReplicationCoordinatorImpl::CatchupState

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 3.4.6, 3.5.9
    • Affects Version/s: None
    • Component/s: Replication
    • None
    • Fully Compatible
    • ALL
    • v3.4
    • Repl 2017-06-19
    • 0

      If we call ReplicationCoordinatorImpl::CatchupState::abort_inlock() at a time when the cancel callback lambda is in progress, we may delete the CatchupState out from under the cancel callback.

      We can fix this by
      1) Capturing the mutex reference directly in the callback lambda rather than accessing it via this->_repl. This mutex belongs to the ReplicationCoordinatorImpl and not the CatchupState, so it survives deletion of the CatchupState.

      2) Check callback cancellation only after acquiring the mutex. This allows a late cancel to work, provided it's done while the lock is held, which is always true because CatchupState is required to be accessed only under lock.

            siyuan.zhou@mongodb.com Siyuan Zhou
            matthew.russotto@mongodb.com Matthew Russotto
            0 Vote for this issue
            3 Start watching this issue