[SERVER-54739] Race in ServerDiscoveryMonitor::requestImmediateCheck could lead to multiple outstanding exhaust requests Created: 23/Feb/21  Updated: 06/Dec/22  Resolved: 25/Feb/21

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: 4.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthew Saltz (Inactive) Assignee: Backlog - Service Architecture
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-54738 Calls to ServerDiscoveryMonitor::requ... Backlog
Assigned Teams:
Service Arch
Operating System: ALL
Participants:
Case:

 Description   

When ServerDiscoveryMonitor::requestImmediateCheck is called, in some cases, we attempt to cancel the outstanding hello request and reschedule a new one.

Cancelling the previous request uses a CallbackHandle object that is set after scheduling the request, which happens after a delay.

This means the following sequence of events could occur:

  1. A call to ServerDiscoveryMonitor::requestImmediateCheck occurs, which calls _scheduleNextHello
  2. The task it schedules to do the exhaust command hangs before actually scheduling the command and setting the callback handle.
  3. A new call to requestImmediateCheck arrives. It tries to cancel the outstanding request, but the callback handle hasn't actually been set yet.
  4. The first request continues, setting the callback handle, which may or may not overwrite the existing one. Either way we end up with two concurrent exhaust commands running for the same host, one of which is no longer tracked or cancellable.

Generated at Thu Feb 08 05:34:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.