[CSHARP-3302] SDAM deadlock when invalidating former primary Created: 08/Jan/21  Updated: 28/Oct/23  Resolved: 20/Jan/21

Status: Closed
Project: C# Driver
Component/s: Connectivity
Affects Version/s: 2.11.0
Fix Version/s: 2.11.6

Type: Bug Priority: Critical - P2
Reporter: James Kovacs Assignee: Boris Dogadov
Resolution: Fixed Votes: 0
Labels: apm-issue, planned-maintenance-detectable-bug
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by CSHARP-3307 Deadlock in ServerMonitor Closed
Case:

 Description   

Consider a driver connected to a replica set with the following cluster topology:

Node1: primary; Node2: secondary; Node3: secondary

An election takes place and Node2 is the new primary, but the driver does not know this yet. If the driver receives heartbeats from Node1 and Node2 at the same time, SDAM can deadlock as follows.

Processing of heartbeat from Node2 (New Primary)

The driver receives a heartbeat from Node2 (the new Primary) in ServerMonitor.HeartbeatAsync and proceeds to process it...

  • Line 389 of ServerMonitor.cs, we call SetDescription(newDescription while holding the ServerMonitor._lock for Node2.
  • SetDescription raises the OnDescriptionChanged event.
  • OnDescriptionChanged executes MultiServerCluster.ServerDescriptionChangedHandler, which calls MultiServerCluster.ProcessServerDescriptionChanged.
  • MultiClusterServer._updateClusterDescriptionLock is acquired and held on line 298 of MultiServerCluster.cs (in MultiServerCluster.ProcessServerDescriptionChanged).
  • Line 333 calls ProcessReplicaSetChange.
  • The MultiClusterServer._serversLock is acquired in ProcessReplicaSetChange (line408).
  • The Server object representing Node1 is invalidated (line 411).
  • Server.Invalidate calls ServerMonitor.RequestHeartbeat for Node1.
  • ServerMonitor.RequestHeartbeat blocks trying to acquire ServerMonitor._lock for Node1.

Thus while processing the heartbeat from Node2 (the new primary), this thread holds:

  • ServerMonitor._lock for Node2
  • MultiServerCluster._updateClusterDescriptionLock for the entire replset
  • MultiServerCluster._serversLock for the entire replset

And the thread is attempting to acquire the ServerMonitor._lock for Node1 (the former primary).

Processing of heartbeat from Node1 (Old Primary)

The driver receives a heartbeat from Node1 (the old Primary) in ServerMonitor.HeartbeatAsync and proceeds to process it...

  • Line 389 of ServerMonitor.cs, we call SetDescription(newDescription while holding the ServerMonitor._lock for Node1.
  • SetDescription raises the OnDescriptionChanged event.
  • OnDescriptionChanged executes MultiServerCluster.ServerDescriptionChangedHandler, which calls MultiServerCluster.ProcessServerDescriptionChanged.
  • Line 298 of MultiServerCluster.cs (in MultiServerCluster.ProcessServerDescriptionChanged) blocks attempting to acquire MultiClusterServer._updateClusterDescriptionLock.

Thus while processing the heartbeat from Node1 (the old primary), this other thread holds:

  • ServerMonitor._lock for Node1

And the thread is attempting to acquire the MultiServerCluster._updateClusterDescriptionLock for the entire replset.

Summary

1. The thread processing the Node2 heartbeat holds the MultiServerCluster._updateClusterDescriptionLock for the cluster but needs ServerMonitor._lock for Node1.
2. The thread processing the Node1 heartbeat holds the ServerMonitor._lock for Node1 but needs the MultiServerCluster._updateClusterDescriptionLock for the cluster.

Neither thread can make forward progress and we find ourselves in a classic deadlock.



 Comments   
Comment by Aristarkh Zagorodnikov [ 20/Jan/21 ]

Excellent news, thanks for fixing this!

Comment by Githook User [ 20/Jan/21 ]

Author:

{'name': 'Boris', 'email': 'boris.dogadov@mongodb.com', 'username': 'BorisDog'}

Message: CSHARP-3302: Backporting compilation fix
Branch: v2.11.x
https://github.com/mongodb/mongo-csharp-driver/commit/03015ffc30a08fe522ba0f32cdd5efc497e05e19

Comment by Githook User [ 20/Jan/21 ]

Author:

{'name': 'Boris', 'email': 'boris.dogadov@mongodb.com', 'username': 'BorisDog'}

Message: CSHARP-3302: Lock eliminated at ServerMonitor.RequestHeartbeat. Eliminated concurrent Cluster.RapidHeartbeatTimerCallback invocations.
Branch: v2.11.x
https://github.com/mongodb/mongo-csharp-driver/commit/2d9fe31155f6cd3e2ec5576612c80ab3c7325a08

Comment by Githook User [ 19/Jan/21 ]

Author:

{'name': 'Boris', 'email': 'boris.dogadov@mongodb.com', 'username': 'BorisDog'}

Message: CSHARP-3302: Lock eliminated at ServerMonitor.RequestHeartbeat. Eliminated concurrent Cluster.RapidHeartbeatTimerCallback invocations.
Branch: master
https://github.com/mongodb/mongo-csharp-driver/commit/a013eb01df35ecce863445987ead1b76221b3845

Generated at Wed Feb 07 21:44:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.