Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
None

Assigned Teams:

Replication
Operating System:
ALL
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

There is a race condition that exists when checking for callback cancellation in ReplicationCoordinator (see _doMemberHeartbeat, for example). We check if the callback status is equal to CallbackCancelled, but this is not necessarily sufficient to see if 'cancel' was actually called on the callback handle already.

Let’s say a callback, with CallbackState C, gets scheduled on the TaskExecutor, to be run at some time in the future. When it is time for that task to run, the ThreadPoolTaskExecutor::runCallback function will be run for C. The ‘runCallback’ function includes the following steps:

      1. If C.canceled==1, then set the status of CallbackArgs, ‘args’, for C to ErrorCodes::CallbackCancelled.
      2. Run task C with CallbackArgs ‘args’

The problem arises when there is a separate thread that tries to call C.cancel() after step 1 has completed and before step 2 has started. The ThreadPoolTaskExecutor::cancel() method atomically changes C.canceled to a value of 1. But, if at the time of step 1’s execution, C.canceled==0, then the CallbackArgs status will not be set to ErrorCodes::CallbackCancelled. For example, consider the following scenario, with two separate threads:

      Initial State: C.canceled=0
      1. [runCallbackThread] If C.canceled==1, then set the status of CallbackArgs, ‘args’, for C to ErrorCodes::CallbackCancelled.
      2. [otherThread] Set C.canceled to 1
      3. [runCallbackThread] Run C with ‘args’

This means that in the body of the callback C, checking if cbData.status==ErrorCodes::CallbackCanceled (as we do in ReplicationCoordinatorImpl::_doMemberHeartbeat, for example) is not sufficient, since it might indicate an OK status even if C.cancel() had already been called on the callback. The C.isCanceled() method is what should be checked to see if cancel() was actually called on C, since it checks the value of C.canceled directly.

Cancellation is a tricky issue, but I think we could, generally, simplify how we check for cancellation to make our lives easier. The definition of a callback being “cancelled” should simply be whether or not the “cancel” method has been called on it some time in the past. To be more precise, it should be defined as the state of the “cancelled” field of CallbackState. Since this field is an atomic integer, all reads and writes to it are serialized, so there will never be an ambiguity about the state of that variable, regardless of how many threads try to access it. There is still the remaining issue of how we choose to make decisions based on that state, but that is a general problem when accessing state that may be read or modified by multiple threads.

It would probably be prudent to review how we check for cancellation throughout the ReplicationCoordinator, to ensure that this issue does not appear in multiple locations.

related to

SERVER-22153 Cancelling heartbeats doesn't work after network operation scheduled

Closed

SERVER-23082 race in ThreadPoolTaskExecutor between scheduleWorkAt() and cancel()

Closed

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Will Schultz
Participants:: [DO NOT USE] Backlog - Replication Team, Andy Schwerin, Will Schultz
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Jun 16 2017 10:17:05 PM UTC
Updated:: Oct 27 2023 01:54:24 PM UTC
Resolved:: Jun 29 2017 07:49:53 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates