Deadlock in topology worker shutdown when SDAM monitor is streaming

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Unknown
    • 3.6.0
    • Affects Version/s: None
    • Component/s: None
    • None
    • None
    • Rust Drivers
    • Not Needed
    • Hide

      1. What would you like to communicate to the user about this feature?
      2. Would you like the user to see examples of the syntax and/or executable code and its output?
      3. Which versions of the driver/connector does this apply to?

      Show
      1. What would you like to communicate to the user about this feature? 2. Would you like the user to see examples of the syntax and/or executable code and its output? 3. Which versions of the driver/connector does this apply to?
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      Client::shutdown() can deadlock when an SDAM monitor is running in streaming mode. The topology worker's shutdown sequence waits for the monitor to exit, while the monitor is blocked waiting for the topology worker to acknowledge an update.

      Reproduction

      The bug requires the monitor to complete an awaitable hello and call topology_updater.update() in the narrow window between the topology worker exiting its processing loop and calling close_monitor(). This is difficult to trigger in pure Rust but has been reproduced when the driver is used via FFI with a shared tokio runtime (e.g., a Java/Panama wrapping driver), and in pure Rust by adding a Thread::sleep.

      Root Cause

      When Client::shutdown() is called:

      The TopologyWorker receives Broadcast(Shutdown), broadcasts to connection pools, and breaks out of its event processing loop.

      It enters the cleanup sequence: drops the publisher, then calls close_monitor().await on each server's MonitorManager.

      close_monitor() drops the WorkerHandle, sends CancellationReason::ServerClosed, and awaits cancellation_sender.closed() — waiting for the monitor task to fully exit.

      Meanwhile, the monitor (running in streaming mode) may have just received a response to its awaitable hello. It processes the reply and calls self.topology_updater.update(server_description).await,
      which sends an UpdateMessage on the unbounded channel and awaits acknowledgment via a oneshot.

      The topology worker's update_receiver is still alive (it's a field on the worker struct), so the send succeeds and the message is buffered. But the worker has already exited its loop and will never recv from the channel again. The oneshot acknowledgment never arrives.

      Circular wait:

      • The topology worker is in close_monitor().await, waiting for the monitor's cancellation receiver to be dropped (i.e., the monitor task to exit).
      • The monitor task is in topology_updater.update().await, waiting for the topology worker to acknowledge its message.

      Neither can make progress.

      Fix

      Drop self.update_receiver after breaking out of the processing loop, before calling close_monitor(). This causes the UnboundedSender::send() in the monitor's update() call to return Err(receiver gone), so send_message() returns false immediately. The monitor proceeds, checks is_alive() (which returns false because close_monitor dropped the handle), and exits.

      Dropping (rather than just closing) the receiver also drops any already-buffered messages, which drops their oneshot senders, unblocking any monitors that already sent an update and are waiting for acknowledgment.

      Behavioral Impact

      Minimal. During the brief shutdown window, a monitor's final ServerDescriptionChanged SDAM event may be dropped (the topology description won't be updated from that last heartbeat). All shutdown lifecycle events (ServerClosed, TopologyDescriptionChanged, TopologyClosed) are unaffected — they are emitted directly in the shutdown sequence, not through the update channel.

            Assignee:
            Abraham Egnor
            Reporter:
            Jeffrey Yemin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: