[JAVA-4471] Election in replica set can cause deadlock in monitor threads, leading to unavailability Created: 03/Feb/22 Updated: 28/Oct/23 Resolved: 07/Feb/22 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Cluster Management |
| Affects Version/s: | 4.4.0 |
| Fix Version/s: | 4.4.2 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Robert Zilke | Assignee: | Valentin Kavalenka |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | external-user | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Case: | (copied to CRM) | ||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Documentation Changes: | Not Needed | ||||||||
| Description |
|
We discovered blocking threads a couple of times in some of our Java processes shortly after an election in the replica set of 3 MongoDB instances, which make our services unresponsive:
|
cluster-ClusterId{value='61f29e20622d627cc0e99c19', description='null'}-mongodb-instance-3:27017 id=24 state=BLOCKED |
- waiting to lock <0x0b2400cb> (a com.mongodb.internal.connection.MultiServerCluster) |
owned by cluster-ClusterId{value='61f29e20622d627cc0e99c19', description='null'}-mongodb-instance-1:27017 id=20 |
at com.mongodb.internal.connection.AbstractMultiServerCluster.onChange(AbstractMultiServerCluster.java:175) |
at com.mongodb.internal.connection.AbstractMultiServerCluster.access$100(AbstractMultiServerCluster.java:50) |
at com.mongodb.internal.connection.AbstractMultiServerCluster$DefaultServerDescriptionChangedListener.serverDescriptionChanged(AbstractMultiServerCluster.java:139) |
at com.mongodb.internal.connection.DefaultSdamServerDescriptionManager.updateDescription(DefaultSdamServerDescriptionManager.java:127) |
at com.mongodb.internal.connection.DefaultSdamServerDescriptionManager.update(DefaultSdamServerDescriptionManager.java:81) |
at com.mongodb.internal.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:165) |
at java.base@11.0.12/java.lang.Thread.run(Thread.java:829) Locked synchronizers: count = 1 |
- java.util.concurrent.locks.ReentrantLock$NonfairSync@32e56896 |
Second blocked thread:
cluster-ClusterId{value='61f29e20622d627cc0e99c19', description='null'}-mongodb-instance-2:27017 id=22 state=BLOCKED |
- waiting to lock <0x0b2400cb> (a com.mongodb.internal.connection.MultiServerCluster) |
owned by cluster-ClusterId{value='61f29e20622d627cc0e99c19', description='null'}-mongodb-instance-1:27017 id=20 |
at com.mongodb.internal.connection.AbstractMultiServerCluster.onChange(AbstractMultiServerCluster.java:175) |
at com.mongodb.internal.connection.AbstractMultiServerCluster.access$100(AbstractMultiServerCluster.java:50) |
at com.mongodb.internal.connection.AbstractMultiServerCluster$DefaultServerDescriptionChangedListener.serverDescriptionChanged(AbstractMultiServerCluster.java:139) |
at com.mongodb.internal.connection.DefaultSdamServerDescriptionManager.updateDescription(DefaultSdamServerDescriptionManager.java:127) |
at com.mongodb.internal.connection.DefaultSdamServerDescriptionManager.update(DefaultSdamServerDescriptionManager.java:81) |
at com.mongodb.internal.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:165) |
at java.base@11.0.12/java.lang.Thread.run(Thread.java:829) Locked synchronizers: count = 1 |
- java.util.concurrent.locks.ReentrantLock$NonfairSync@166cbc02 |
We renamed the host in the thread dump, but kept the instance number.
The only way to fix this problem is to restart our processes.
We discovered this problem since we upgraded from MongoDB Driver 4.2.3 (Spring Boot 2.5.6) to MongoDB Driver 4.4.0 (Spring Boot 2.6.2) and later MongoDB Driver 4.4.1 (Spring Boot 2.6.3)
Our MongoDB Server instances run on version 4.2.2, what should be compatible according to https://docs.mongodb.com/drivers/java/sync/current/compatibility/
Unfortunately, we are not able to reproduce these phenomena. We just see them in irregular intervals and have to restart our services.
We can provide more information if needed.
Thanks for your help!
| Comments |
| Comment by Githook User [ 07/Feb/22 ] |
|
Author: {'name': 'Valentin Kovalenko', 'email': 'valentin.kovalenko@mongodb.com', 'username': 'stIncMale'}Message: Introduce and use `Cluster.withLock` to achieve cluster-wide locking (#872) This is a backport of https://github.com/mongodb/mongo-java-driver/pull/869.
|
| Comment by Githook User [ 07/Feb/22 ] |
|
Author: {'name': 'Valentin Kovalenko', 'email': 'valentin.kovalenko@mongodb.com', 'username': 'stIncMale'}Message: Refactor internal APIs to avoid requiring duplicate data in method/constructor params (#871)
|
| Comment by Githook User [ 07/Feb/22 ] |
|
Author: {'name': 'Valentin Kovalenko', 'email': 'valentin.kovalenko@mongodb.com', 'username': 'stIncMale'}Message: Introduce and use `Cluster.withLock` to achieve cluster-wide locking (#869)
|
| Comment by Valentin Kavalenka [ 04/Feb/22 ] |
|
Thank you, robert.zilke@outlook.com! Now it is clear that the threads id=20 and id=22 are blocking each other. They are both server monitor threads (DefaultServerMonitor.monitorThread) for servers instance-1 and instance-2 respectively. The deadlock detailsThe monitor thread for instance-2 (id=22) tries to update its server description, so it acquired DefaultSdamServerDescriptionManager.lock for instance-2 and proceeded into AbstractMultiServerCluster$DefaultServerDescriptionChangedListener.serverDescriptionChanged where it needs to acquire another lock for the cluster to which both servers belong. This cluster lock is being held the other thread. Concurrently, the monitor thread for instance-1 (id=20) also tries to update its server description, so it acquired DefaultSdamServerDescriptionManager.lock for instance-1 (not for instance-2, and that is why it succeeded) and proceeded into AbstractMultiServerCluster$DefaultServerDescriptionChangedListener.serverDescriptionChanged where, unlike the monitor thread for instance-2, it successfully acquired the cluster lock. Then it proceeded to AbstractMultiServerCluster.invalidateOldPrimaries, discovered that instance-2 is the old primary, and so it proceeded to updating its description. This however, requires acquiring DefaultSdamServerDescriptionManager.lock for instance-2, which is being held by the monitor thread for instance-2. What changed since 4.3 that caused the problem
We are working on fixing the bug. |
| Comment by Robert Zilke [ 04/Feb/22 ] |
|
I attached the full thread-dump, but I had to anonymize our customer name: threaddump-anonymized.tdump |
| Comment by Valentin Kavalenka [ 04/Feb/22 ] |
|
Hi robert.zilke@outlook.com, Could you please attach the full thread dump? If that is not possible because requires sharing more than you can, could you attach the stack trace for the thread id=20, and the stack trace of the thread that holds whatever this thread is trying to acquire? |