[JAVA-2238] Connection and thread leakage in DefaultServerMonitor Created: 06/Jul/16  Updated: 01/Dec/16  Resolved: 15/Jul/16

Status: Closed
Project: Java Driver
Component/s: Connection Management
Affects Version/s: 3.0.0
Fix Version/s: 3.3.0

Type: Bug Priority: Critical - P2
Reporter: Alexander Bulaev Assignee: Jeffrey Yemin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by JAVA-2290 Server monitor thread dies Closed

 Description   

When experiencing network issues between mongos and application server, we found that there is massive connection and thread leakage. In jstack this looks like this:

"cluster-ClusterId{value='577cd577c333491dceb8c8af', description='null'}-mongos01h.load.music.yandex.net:27017" #5957 daemon prio=5 os_prio=0 tid=0x00007f75e039c000 nid=0x18185 waiting on condition [0x00007f6b00d4c000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000386c3b1c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForSignalOrTimeout(DefaultServerMonitor.java:238)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForNext(DefaultServerMonitor.java:219)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:168)
        - locked <0x000000038c39b358> (a com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable)
        at java.lang.Thread.run(Thread.java:745)
 
"cluster-ClusterId{value='577cd577c333491dceb8c8af', description='null'}-mongos01h.load.music.yandex.net:27017" #5956 daemon prio=5 os_prio=0 tid=0x00007f6be819a800 nid=0x18184 waiting on condition [0x00007f6b00a49000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000386c3b1c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForSignalOrTimeout(DefaultServerMonitor.java:238)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForNext(DefaultServerMonitor.java:219)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:168)
        - locked <0x0000000396151050> (a com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable)
        at java.lang.Thread.run(Thread.java:745)
 
"cluster-ClusterId{value='577cd577c333491dceb8c8af', description='null'}-mongos01h.load.music.yandex.net:27017" #5948 daemon prio=5 os_prio=0 tid=0x00007f7620083800 nid=0x1817c waiting on condition [0x00007f6b00b4a000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000386c3b1c8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForSignalOrTimeout(DefaultServerMonitor.java:238)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForNext(DefaultServerMonitor.java:219)
        at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:168)
        - locked <0x0000000396151230> (a com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable)
        at java.lang.Thread.run(Thread.java:745)

And up to 6-7 thousands of these useless threads per application server. At that point our mongos was killed by "too many opened files" error.



 Comments   
Comment by Roy Rim [ 01/Dec/16 ]

Hi Matt,

This is Roy Rim, I've been working with Josh.

I don't believe that the 5,280 instances of these threads are due to the bug. "cluster-ClusterId

{value='581757785d1b1e2a13be626d', description='null'}

-10.10.10.10:27017" is generated from the MongoClient clusterId and the specific mongod/mongos address and port.

If you attempt to get unique values of just the MongoClient clusterId you will end up with 1056 instances:

Roys-MacBook-Pro-2:Downloads royrim$ grep "cluster-ClusterId" QA_10-230-209-190.jstack | grep -o '"cluster-ClusterId[^,]\+' | sort | uniq -c | wc -l
    1056

1056 multiplied by the number of nodes in the replica set equals 5,280. You can see that list by running the below:

Roys-MacBook-Pro-2:Downloads royrim$ grep "cluster-ClusterId" QA_10-230-209-190.jstack | grep -o '"[^"]\+"' | sort | uniq -c
   1 "cluster-ClusterId{value='581757325d1b1e2a13be61c3', description='null'}-10.225.171.132:27017"
   1 "cluster-ClusterId{value='581757325d1b1e2a13be61c3', description='null'}-10.225.172.138:27017"
   1 "cluster-ClusterId{value='581757325d1b1e2a13be61c3', description='null'}-10.230.212.167:27017"
   1 "cluster-ClusterId{value='581757325d1b1e2a13be61c3', description='null'}-10.230.212.49:27017"
   1 "cluster-ClusterId{value='581757325d1b1e2a13be61c3', description='null'}-10.230.212.7:27017"
   1 "cluster-ClusterId{value='5817573a5d1b1e2a13be61c4', description='null'}-10.225.171.132:27017"
   1 "cluster-ClusterId{value='5817573a5d1b1e2a13be61c4', description='null'}-10.225.172.138:27017"
   1 "cluster-ClusterId{value='5817573a5d1b1e2a13be61c4', description='null'}-10.230.212.167:27017"
   1 "cluster-ClusterId{value='5817573a5d1b1e2a13be61c4', description='null'}-10.230.212.49:27017"
   1 "cluster-ClusterId{value='5817573a5d1b1e2a13be61c4', description='null'}-10.230.212.7:27017"

If these 1056 instances were generated by the bug identified by JAVA-2238 then you would not see 1056 unique values of the clusterId, instead you would see 1056 instances of the same clusterId or just a few variations. Each unique clusterId represents a new MongoClient getting instantiated. Something else is causing MongoClient instances to be created over and over again.

I have asked Josh to look into any recent changes in the code base. Ideally the code should be ensuring that MongoClient instantiations are getting closed when appropriate. For instance when deploying a bundle if the JVM itself is not getting shut down and new MongoClient instances are brought up, old instances should be closed.

Regards,
Roy Rim

Comment by Matt Robson [ 01/Dec/16 ]

Thanks. If that's the case, than we are still looking at having a similar problem, even with 3.3.0.

Looking at 5300 of these threads across 5 mongo's. The stack below does look to line up with the corrected commit.

"cluster-ClusterId{value='581757785d1b1e2a13be626d', description='null'}-10.10.10.10:27017" daemon prio=10 tid=0x00007fdf3daec800 nid=0x39b4 waiting on condition [0x00007fdead334000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000006f65061b8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
	at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForSignalOrTimeout(DefaultServerMonitor.java:238)
	at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForNext(DefaultServerMonitor.java:219)
	at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:156)
	- locked <0x00000006f6506188> (a com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable)
	at java.lang.Thread.run(Thread.java:745)
 
   Locked ownable synchronizers:
	- None

Edit;
Irrespective, going to be testing with 3.4 as well.

Comment by Jeffrey Yemin [ 01/Dec/16 ]

Hi Matt,

I looked more closely and discovered that the r3.3.0 tag was not on the correct commit. That's been fixed, and you can now see that the fix for this issue was indeed included in the 3.3.0 release, as suggested by the fix version.

Thanks very much for bringing this to our attention.

Comment by Jeffrey Yemin [ 29/Nov/16 ]

The 3.4.0 will be announced today (it's already available from Maven Central). Will you be able to upgrade to 3.4.0?

Comment by Matt Robson [ 29/Nov/16 ]

Morning!

This issue was committed into 3.3.x branch July 15th, but missed the 3.3.0 tag June 30th, this is not fixed in 3.3.0.

https://github.com/mongodb/mongo-java-driver/blob/r3.3.0/driver-core/src/main/com/mongodb/connection/DefaultServerMonitor.java

This is only fixed in 3.4.

Edit:
Is it possible to cut a 3.3.1 release to include this fix as it's already committed into the 3.3.x branch?

Comment by Githook User [ 15/Jul/16 ]

Author:

{u'username': u'jyemin', u'name': u'Jeff Yemin', u'email': u'jeff.yemin@10gen.com'}

Message: JAVA-2238: Remove DefaultServerMonitor.invalidate and replace usage of it with DefaultServerMonitor.check. This removes any need to create a new monitor thread and simplifies any necessary synchronization that goes along with that.
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/2c14b2ba9bee8d72fc080ba672ba1fa07a60b3c0

Comment by Alexander Bulaev [ 07/Jul/16 ]

Also a hint from our load testers: set a low ulimit for mongos so it would die from too much opened connections and restart in a cycle (we use our own scripts to restart broken mongoses). This is roughly what happened in our production.

Comment by Alexander Bulaev [ 07/Jul/16 ]

Ross, we have performed our own test at the load environment and the fix definetely works. We also plan to deploy forked and patched driver in production for the time this fix is not merged upstream.
However this is our load environment details:
1 x application server
2 x mongos (on separate machines)
5 x mongod (sharded)
1 x config server
The more mongoses you have, the worse the problem is (this is clear given my PR ).
To reproduce the failure one have to violently turn off and back on mongos many times under heavy load (ie imitating some typical multi-threaded web server workload). Introducing some network losses also seem to help but isn't strictly necessary.

Comment by Ross Lawley [ 07/Jul/16 ]

Thanks for confirming alexbool

Could you also provide environment details as well? I'm hoping to reproduce the issue locally to ensure there are no other fixes needed.

All the best,

Ross

Comment by Alexander Bulaev [ 07/Jul/16 ]

Hi Ross, we reproduced this bug in 3.2.x branch, including 3.2.2, and numerous earlier versions

Comment by Ross Lawley [ 07/Jul/16 ]

Hi alexbool,

Thanks for the ticket and the PR - can you confirm the version you are using and seeing this issue on?

Ross

Comment by Alexander Bulaev [ 06/Jul/16 ]

I've opened a PR: https://github.com/mongodb/mongo-java-driver/pull/359

Generated at Thu Feb 08 08:56:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.