Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-29382

Fix cancel race related to the timeout in ReplicationCoordinatorImpl::CatchupState

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.4.6, 3.5.9
    • Component/s: Replication
    • Labels:
      None

      Description

      If we call ReplicationCoordinatorImpl::CatchupState::abort_inlock() at a time when the cancel callback lambda is in progress, we may delete the CatchupState out from under the cancel callback.

      We can fix this by
      1) Capturing the mutex reference directly in the callback lambda rather than accessing it via this->_repl. This mutex belongs to the ReplicationCoordinatorImpl and not the CatchupState, so it survives deletion of the CatchupState.

      2) Check callback cancellation only after acquiring the mutex. This allows a late cancel to work, provided it's done while the lock is held, which is always true because CatchupState is required to be accessed only under lock.

        Attachments

          Activity

            People

            Assignee:
            siyuan.zhou Siyuan Zhou
            Reporter:
            matthew.russotto Matthew Russotto
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: