Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-29382

Fix cancel race related to the timeout in ReplicationCoordinatorImpl::CatchupState

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4.6, 3.5.9
    • Component/s: Replication
    • Labels:
      None

      Description

      If we call ReplicationCoordinatorImpl::CatchupState::abort_inlock() at a time when the cancel callback lambda is in progress, we may delete the CatchupState out from under the cancel callback.

      We can fix this by
      1) Capturing the mutex reference directly in the callback lambda rather than accessing it via this->_repl. This mutex belongs to the ReplicationCoordinatorImpl and not the CatchupState, so it survives deletion of the CatchupState.

      2) Check callback cancellation only after acquiring the mutex. This allows a late cancel to work, provided it's done while the lock is held, which is always true because CatchupState is required to be accessed only under lock.

        Attachments

          Activity

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: