Uploaded image for project: 'C# Driver'
  1. C# Driver
  2. CSHARP-3302

SDAM deadlock when invalidating former primary

      Consider a driver connected to a replica set with the following cluster topology:

      Node1: primary; Node2: secondary; Node3: secondary

      An election takes place and Node2 is the new primary, but the driver does not know this yet. If the driver receives heartbeats from Node1 and Node2 at the same time, SDAM can deadlock as follows.

      Processing of heartbeat from Node2 (New Primary)

      The driver receives a heartbeat from Node2 (the new Primary) in ServerMonitor.HeartbeatAsync and proceeds to process it...

      • Line 389 of ServerMonitor.cs, we call SetDescription(newDescription while holding the ServerMonitor._lock for Node2.
      • SetDescription raises the OnDescriptionChanged event.
      • OnDescriptionChanged executes MultiServerCluster.ServerDescriptionChangedHandler, which calls MultiServerCluster.ProcessServerDescriptionChanged.
      • MultiClusterServer._updateClusterDescriptionLock is acquired and held on line 298 of MultiServerCluster.cs (in MultiServerCluster.ProcessServerDescriptionChanged).
      • Line 333 calls ProcessReplicaSetChange.
      • The MultiClusterServer._serversLock is acquired in ProcessReplicaSetChange (line408).
      • The Server object representing Node1 is invalidated (line 411).
      • Server.Invalidate calls ServerMonitor.RequestHeartbeat for Node1.
      • ServerMonitor.RequestHeartbeat blocks trying to acquire ServerMonitor._lock for Node1.

      Thus while processing the heartbeat from Node2 (the new primary), this thread holds:

      • ServerMonitor._lock for Node2
      • MultiServerCluster._updateClusterDescriptionLock for the entire replset
      • MultiServerCluster._serversLock for the entire replset

      And the thread is attempting to acquire the ServerMonitor._lock for Node1 (the former primary).

      Processing of heartbeat from Node1 (Old Primary)

      The driver receives a heartbeat from Node1 (the old Primary) in ServerMonitor.HeartbeatAsync and proceeds to process it...

      • Line 389 of ServerMonitor.cs, we call SetDescription(newDescription while holding the ServerMonitor._lock for Node1.
      • SetDescription raises the OnDescriptionChanged event.
      • OnDescriptionChanged executes MultiServerCluster.ServerDescriptionChangedHandler, which calls MultiServerCluster.ProcessServerDescriptionChanged.
      • Line 298 of MultiServerCluster.cs (in MultiServerCluster.ProcessServerDescriptionChanged) blocks attempting to acquire MultiClusterServer._updateClusterDescriptionLock.

      Thus while processing the heartbeat from Node1 (the old primary), this other thread holds:

      • ServerMonitor._lock for Node1

      And the thread is attempting to acquire the MultiServerCluster._updateClusterDescriptionLock for the entire replset.

      Summary

      1. The thread processing the Node2 heartbeat holds the MultiServerCluster._updateClusterDescriptionLock for the cluster but needs ServerMonitor._lock for Node1.
      2. The thread processing the Node1 heartbeat holds the ServerMonitor._lock for Node1 but needs the MultiServerCluster._updateClusterDescriptionLock for the cluster.

      Neither thread can make forward progress and we find ourselves in a classic deadlock.

            Assignee:
            boris.dogadov@mongodb.com Boris Dogadov
            Reporter:
            james.kovacs@mongodb.com James Kovacs
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: