[JAVA-4471] Election in replica set can cause deadlock in monitor threads, leading to unavailability Created: 03/Feb/22  Updated: 28/Oct/23  Resolved: 07/Feb/22

Status: Closed
Project: Java Driver
Component/s: Cluster Management
Affects Version/s: 4.4.0
Fix Version/s: 4.4.2

Type: Bug Priority: Critical - P2
Reporter: Robert Zilke Assignee: Valentin Kavalenka
Resolution: Fixed Votes: 0
Labels: external-user
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2022-02-03-17-15-21-102.png     PNG File image-2022-02-03-17-15-43-755.png     File threaddump-anonymized.tdump    
Issue Links:
Problem/Incident
is caused by JAVA-3928 Connection pool paused state Closed
Case:
Backwards Compatibility: Fully Compatible
Documentation Changes: Not Needed

 Description   

We discovered blocking threads a couple of times in some of our Java processes shortly after an election in the replica set of 3 MongoDB instances, which make our services unresponsive:

The Thread-Dump looks, in all processes with blocking threads, similar.

First blocked thread:

 

cluster-ClusterId{value='61f29e20622d627cc0e99c19', description='null'}-mongodb-instance-3:27017 id=24 state=BLOCKED
    - waiting to lock <0x0b2400cb> (a com.mongodb.internal.connection.MultiServerCluster)
     owned by cluster-ClusterId{value='61f29e20622d627cc0e99c19', description='null'}-mongodb-instance-1:27017 id=20
    at com.mongodb.internal.connection.AbstractMultiServerCluster.onChange(AbstractMultiServerCluster.java:175)
    at com.mongodb.internal.connection.AbstractMultiServerCluster.access$100(AbstractMultiServerCluster.java:50)
    at com.mongodb.internal.connection.AbstractMultiServerCluster$DefaultServerDescriptionChangedListener.serverDescriptionChanged(AbstractMultiServerCluster.java:139)
    at com.mongodb.internal.connection.DefaultSdamServerDescriptionManager.updateDescription(DefaultSdamServerDescriptionManager.java:127)
    at com.mongodb.internal.connection.DefaultSdamServerDescriptionManager.update(DefaultSdamServerDescriptionManager.java:81)
    at com.mongodb.internal.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:165)
    at java.base@11.0.12/java.lang.Thread.run(Thread.java:829)    Locked synchronizers: count = 1
      - java.util.concurrent.locks.ReentrantLock$NonfairSync@32e56896

Second blocked thread:

 

 

cluster-ClusterId{value='61f29e20622d627cc0e99c19', description='null'}-mongodb-instance-2:27017 id=22 state=BLOCKED
    - waiting to lock <0x0b2400cb> (a com.mongodb.internal.connection.MultiServerCluster)
     owned by cluster-ClusterId{value='61f29e20622d627cc0e99c19', description='null'}-mongodb-instance-1:27017 id=20
    at com.mongodb.internal.connection.AbstractMultiServerCluster.onChange(AbstractMultiServerCluster.java:175)
    at com.mongodb.internal.connection.AbstractMultiServerCluster.access$100(AbstractMultiServerCluster.java:50)
    at com.mongodb.internal.connection.AbstractMultiServerCluster$DefaultServerDescriptionChangedListener.serverDescriptionChanged(AbstractMultiServerCluster.java:139)
    at com.mongodb.internal.connection.DefaultSdamServerDescriptionManager.updateDescription(DefaultSdamServerDescriptionManager.java:127)
    at com.mongodb.internal.connection.DefaultSdamServerDescriptionManager.update(DefaultSdamServerDescriptionManager.java:81)
    at com.mongodb.internal.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:165)
    at java.base@11.0.12/java.lang.Thread.run(Thread.java:829)    Locked synchronizers: count = 1
      - java.util.concurrent.locks.ReentrantLock$NonfairSync@166cbc02

 

We renamed the host in the thread dump, but kept the instance number.

The only way to fix this problem is to restart our processes.

We discovered this problem since we upgraded from MongoDB Driver 4.2.3 (Spring Boot 2.5.6) to MongoDB Driver 4.4.0 (Spring Boot 2.6.2) and later MongoDB Driver 4.4.1 (Spring Boot 2.6.3)

Our MongoDB Server instances run on version 4.2.2, what should be compatible according to https://docs.mongodb.com/drivers/java/sync/current/compatibility/ 

How to Reproduce

Unfortunately, we are not able to reproduce these phenomena. We just see them in irregular intervals and have to restart our services. 

Additional Background

  • We use the synchronous Java Driver
  • Java 11
  • Linux 

We can provide more information if needed.

Thanks for your help!

 



 Comments   
Comment by Githook User [ 07/Feb/22 ]

Author:

{'name': 'Valentin Kovalenko', 'email': 'valentin.kovalenko@mongodb.com', 'username': 'stIncMale'}

Message: Introduce and use `Cluster.withLock` to achieve cluster-wide locking (#872)

This is a backport of https://github.com/mongodb/mongo-java-driver/pull/869.

JAVA-4471
Branch: 4.4.x
https://github.com/mongodb/mongo-java-driver/commit/120c6dd0cf46370576303eadfde559dffdd20d84

Comment by Githook User [ 07/Feb/22 ]

Author:

{'name': 'Valentin Kovalenko', 'email': 'valentin.kovalenko@mongodb.com', 'username': 'stIncMale'}

Message: Refactor internal APIs to avoid requiring duplicate data in method/constructor params (#871)

JAVA-4471
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/ffe64ecee1f1699c10e0e92d381da0a2b6cd563d

Comment by Githook User [ 07/Feb/22 ]

Author:

{'name': 'Valentin Kovalenko', 'email': 'valentin.kovalenko@mongodb.com', 'username': 'stIncMale'}

Message: Introduce and use `Cluster.withLock` to achieve cluster-wide locking (#869)

JAVA-4471
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/47c99b5acc49c797c0fb8ee2178a4a6b86dca0c5

Comment by Valentin Kavalenka [ 04/Feb/22 ]

Thank you, robert.zilke@outlook.com!

Now it is clear that the threads id=20 and id=22 are blocking each other. They are both server monitor threads (DefaultServerMonitor.monitorThread) for servers instance-1 and instance-2 respectively.

The deadlock details

The monitor thread for instance-2 (id=22) tries to update its server description, so it acquired DefaultSdamServerDescriptionManager.lock for instance-2 and proceeded into AbstractMultiServerCluster$DefaultServerDescriptionChangedListener.serverDescriptionChanged where it needs to acquire another lock for the cluster to which both servers belong. This cluster lock is being held the other thread.

Concurrently, the monitor thread for instance-1 (id=20) also tries to update its server description, so it acquired DefaultSdamServerDescriptionManager.lock for instance-1 (not for instance-2, and that is why it succeeded) and proceeded into AbstractMultiServerCluster$DefaultServerDescriptionChangedListener.serverDescriptionChanged where, unlike the monitor thread for instance-2, it successfully acquired the cluster lock. Then it proceeded to AbstractMultiServerCluster.invalidateOldPrimaries, discovered that instance-2 is the old primary, and so it proceeded to updating its description. This however, requires acquiring DefaultSdamServerDescriptionManager.lock for instance-2, which is being held by the monitor thread for instance-2.

What changed since 4.3 that caused the problem

JAVA-3928 was done in 4.4. In 4.3 monitor threads were updating the cluster (which results in the server becoming either selectable or not) and invalidating the pool without acquiring any locks to guard both operations. In 4.4 the behavior of invalidating a pool was changed to also pause it (make unusable until unpaused). With this change the actions of updating the cluster and invalidating the pool had to be done atomically, to avoid exposing a server with a paused pool as selectable. DefaultSdamServerDescriptionManager.lock was introduced to guard these actions, but adding it made the code executed by server monitor threads prone to deadlocks.

We are working on fixing the bug.

Comment by Robert Zilke [ 04/Feb/22 ]

I attached the full thread-dump, but I had to anonymize our customer name: threaddump-anonymized.tdump

Comment by Valentin Kavalenka [ 04/Feb/22 ]

Hi robert.zilke@outlook.com,
Thank you for the report.
The two threads id=22 and id=24 whose stack traces you provided cannot progress because of the thread id=20 being somewhere inside this synchronized block. However, since there is no stack trace for the latter thread, it is unclear what prevents it from progressing. While it is possible that it tries to acquire a resource that is being held by one of the two threads you specified (id=22 and id=24), it is also possible that it is blocked from progressing by a completely different thread.

Could you please attach the full thread dump? If that is not possible because requires sharing more than you can, could you attach the stack trace for the thread id=20, and the stack trace of the thread that holds whatever this thread is trying to acquire?

Generated at Thu Feb 08 09:02:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.