Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-39092

ReplicationStateTransitionLockGuard should be resilient to exceptions thrown before waitForLockUntil()

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.1.9
    • Component/s: Replication
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Sprint:
      Repl 2019-02-25, Repl 2019-03-11
    • Linked BF Score:
      58

      Description

      In ReplicationStateTransitionLockGuard destructor, we have an invariant which checks for  lock result not equal to LOCK_WAITING before unlocking the rstl lock. But, for the below valid event sequence,  we would be calling the ReplicationStateTransitionLockGuard destructor with _result set as "LOCK_WAITING" to unlock the rstl lock.

      1) Thread A  issues stepdown cmd ( can be triggered either by heartbeat or user) .

      2) Thread B issues conditional stepdown triggered by the user.

      3) Thread A marks thread B as killed.  (EDIT : One step down thread cannot mark another step down thread as killed, because during step down we currently kill only user operations that have taken global lock in X, IX and IS mode. Step down thread only try to take RSTL lock in X mode and not global lock.)

      3) Thread A acquires the rstl lock in X mode.

      4) Thread B enqueues the rstl lock and set the _result as LOCK_WAITING.

      5) Thread B calls ReplicationStateTransitionLockGuard::waitForLockUntil with non-zero timeout.

      6) Thread B wait for rstl lock is interrupted (or even for time out case) and lead to calling ReplicationStateTransitionLockGuard destructor with _result as "LOCK_WAITING" leading to invariant failure. (EDIT: Thread B can time out waiting for the lock).

      Note: There is no need to worry that the rstl lock state won't be cleaned up, because unlockOnErrorGuard in LockerImpl::lockComplete will clean up the state in the lock manger and in the locker on any failed lock attempts. Effectively when we hit the ReplicationStateTransitionLockGuard destructor, there is nothing to clean up for the above scenario.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: