[SERVER-22153] Cancelling heartbeats doesn't work after network operation scheduled Created: 12/Jan/16  Updated: 16/Jun/17  Resolved: 23/Mar/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Scott Hernandez (Inactive) Assignee: William Schultz (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-22523 investigate liveness timeout resetting Closed
Related
related to SERVER-21795 Do not reschedule more than one liven... Closed
is related to SERVER-29703 Race condition when checking for call... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4, v3.2
Sprint: Repl 2017-02-13, Repl 2017-03-06, Repl 2017-03-27
Participants:
Linked BF Score: 0

 Description   

When cancelling a heartbeat after it has scheduled the network operation, cancel doesn't work correctly from the client perspective because the handle that is cancelled is not the network operation. This is due to how we schedule work which then schedules the network operation, which does the callback (and reschedules more the next heartbeat).

One idea to fix this would be to keep a CancellableHandle which gets replaced with the newly scheduled network event so that a client can cancel the logical operation at any time.



 Comments   
Comment by William Schultz (Inactive) [ 20/Mar/17 ]

In both _scheduleMemberHeartbeatToTarget and _doMemberHeartbeat, the handles returned by scheduleWorkAt and scheduleRemoteCommand are tracked by _trackHeartbeatHandle.

The behavior of ReplicationExecutor::scheduleRemoteCommand is specified by the TaskExecutor interface as:

    /**
     * Schedules "cb" to be run by the executor with the result of executing the remote command
     * described by "request".
     *
     * Returns a handle for waiting on or canceling the callback, or
     * ErrorCodes::ShutdownInProgress.
     *
     * May be called by client threads or callbacks running in the executor.
     */
     virtual StatusWith<CallbackHandle> scheduleWorkAt(Date_t when, const CallbackFn& work) = 0;

and ReplicationExecutor::scheduleWorkAt

    /**
     * Schedules "work" to be run by the executor no sooner than "when".
     *
     * Returns a handle for waiting on or canceling the callback, or
     * ErrorCodes::ShutdownInProgress.
     *
     * May be called by client threads or callbacks running in the executor.
     */
     virtual StatusWith<CallbackHandle> scheduleRemoteCommand(const RemoteCommandRequest& request, const RemoteCommandCallbackFn& cb) = 0;

We cancel heartbeats by calling cancel on the handles returned by these functions in _cancelHeartbeats_inlock. If these cancel operations do not behave the way they are specified by the TaskExecutor interface, that should be marked as an issue with with the ReplicationExecutor implementation. The tracking of all heartbeat handles seems to be done correctly, however, at the ReplicationCoordinator layer.

Comment by Judah Schvimer [ 20/Jan/17 ]

After investigating this, it is unclear what this ticket is referring to. The network operation's handle is tracked here and then cancelled when a user calls cancel. Following through the callbacks, it seems all of the handles are tracked and untracked correctly and all are done under the _topoMutex. I think _cancelHeartbeats_inlock() is called without the _topoMutex being held here which could be a problem, but that's not what this ticket is referring to.

I'm putting this back on the backlog for future investigation.

Generated at Thu Feb 08 03:59:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.