Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.1.9
Affects Version/s: None
Component/s: Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Repl 2019-02-25, Repl 2019-03-11
Linked BF Score:
58
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

In ReplicationStateTransitionLockGuard destructor, we have an invariant which checks for lock result not equal to LOCK_WAITING before unlocking the rstl lock. But, for the below valid event sequence, we would be calling the ReplicationStateTransitionLockGuard destructor with _result set as "LOCK_WAITING" to unlock the rstl lock.

1) Thread A issues stepdown cmd ( can be triggered either by heartbeat or user) .

2) Thread B issues conditional stepdown triggered by the user.

3) ~~Thread A marks thread B as killed.~~ (EDIT : One step down thread cannot mark another step down thread as killed, because during step down we currently kill only user operations that have taken global lock in X, IX and IS mode. Step down thread only try to take RSTL lock in X mode and not global lock.)

3) Thread A acquires the rstl lock in X mode.

4) Thread B enqueues the rstl lock and set the _result as LOCK_WAITING.

5) Thread B calls ReplicationStateTransitionLockGuard::waitForLockUntil with non-zero timeout.

6) Thread B wait for rstl lock ~~is interrupted~~ (or even for time out case) and lead to calling ReplicationStateTransitionLockGuard destructor with _result as "LOCK_WAITING" leading to invariant failure. (EDIT: Thread B can time out waiting for the lock).

Note: There is no need to worry that the rstl lock state won't be cleaned up, because unlockOnErrorGuard in LockerImpl::lockComplete will clean up the state in the lock manger and in the locker on any failed lock attempts. Effectively when we hit the ReplicationStateTransitionLockGuard destructor, there is nothing to clean up for the above scenario.

duplicates

SERVER-39425 Lock acquisition timeout should always throw rather than fail silently with LOCK_TIMEOUT result

Closed

Assignee:: Suganthi Mani
Reporter:: Suganthi Mani
Participants:: Githook User, Siyuan Zhou, Suganthi Mani
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Jan 18 2019 07:00:50 PM UTC
Updated:: Oct 29 2023 10:24:58 PM UTC
Resolved:: Mar 07 2019 05:50:38 AM UTC
Confidence Status Last Update:: 21/Feb/19 7:52 AM

Details

Description

Attachments

Issue Links

Activity

People

Dates