[JAVA-1881] Java application CPU 100% when mongod replica set member is stopped Created: 07/Jul/15  Updated: 01/Apr/16  Resolved: 10/Jul/15

Status: Closed
Project: Java Driver
Component/s: Async, Connection Management
Affects Version/s: 3.0.0
Fix Version/s: 3.1.0

Type: Bug Priority: Major - P3
Reporter: Cory Thomas Assignee: Ross Lawley
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

java version "1.7.0_79"
OpenJDK Runtime Environment (amzn-2.5.5.1.59.amzn1-x86_64 u79-b14)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)

Amazon EC2 instances running Amazon Linux.

mongod 3.0.4
Async Java driver 3.0.2


Attachments: Text File app.log     File mongo-driver-test.tar.gz    

 Description   

The non-default configuration options used by the application:

readPreference: nearest
cluster:
  hosts: mongodb-oracle-a1-ci-set1, mongodb-oracle-b1-ci-set1, mongodb-oracle-c1-ci-set1
codeRegistry: jackson (https://github.com/ylemoigne/mongo-jackson-codec)

We have a replica set with 3 members, identified by host names mongodb-oracle-a1-ci-set1, mongodb-oracle-b1-ci-set1, and mongodb-oracle-c1-ci-set1. There is a Java application running on a fourth host. If I stop any one of the mongod daemons, primary or secondary, while the application is running, CPU for the application process goes to 100% and the application is unusable until restarted. I can replicate the issue consistently.

Attached is the log output when replicating the issue.

  1. The application is started with mongodb-oracle-a1-ci-set1 daemon stopped. Everything looks fine. The other two daemons are operating normally.
  2. At about 20:24:23, I start mongod on mongodb-oracle-a1-ci-set1. Everything looks fine. The new member joins the replica set and the log indicates that the application connects to it.
  3. At about 20:25:05, I stop mongod on mongodb-oracle-a1-ci-set1. The application process CPU goes to 100%. From the logs, it looks like the application still thinks its connected to the missing member.
    4. At about 20:26:04, I start the mongod on mongodb-oracle-a1-ci-set1. The application process CPU remains at 100%.

Note that no mongo queries are executed. The application is sitting idle when all this occurs. Once in a bad state, the application is unresponsive and must be restarted.



 Comments   
Comment by Jeffrey Yemin [ 16/Jul/15 ]

Fix is now available in the 3.0.3 release.

Comment by Cory Thomas [ 10/Jul/15 ]

Awesome! Thanks a bunch!

Comment by Ross Lawley [ 10/Jul/15 ]

Fixed in master and on the 3.0.x branch. The 3.0.3 release is currently scheduled for next week and a snapshot with this fix will be available shortly.

Alternatively, tests with netty showed that it wasn't impacted by this bug and is a possible workaround until then.

Comment by Githook User [ 10/Jul/15 ]

Author:

{u'username': u'rozza', u'name': u'Ross Lawley', u'email': u'ross.lawley@gmail.com'}

Message: Check for premature end of stream in Async

JAVA-1881
Branch: 3.0.x
https://github.com/mongodb/mongo-java-driver/commit/75d8cda40a52ee42c348a407489b9160bb85b2b9

Comment by Githook User [ 10/Jul/15 ]

Author:

{u'username': u'rozza', u'name': u'Ross Lawley', u'email': u'ross.lawley@gmail.com'}

Message: Check for premature end of stream in Async

JAVA-1881
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/e51a5d16a86f0eedb34a7cb48ab0f741f4b1b807

Comment by Ross Lawley [ 10/Jul/15 ]

Thanks cthomas, thats the issue unlike the sync version there isn't a guarded check for this.

Comment by Cory Thomas [ 09/Jul/15 ]

I think I have found a clue. I connected a debugger up to my simple app running on an EC2 instance. The problem seems to be located in the AsynchronousSocketChannelStream completion handler. According to AsynchronousSocketChannel documentation, the result value is -1 if no bytes could be read because the channel has reached end of stream. In the debugger, I am seeing the result value as -1 when the driver is trying to execute ismaster command. However, since the driver does not handle the result value, it just tries to read again. The read operation completes immediately because the channel is closed, so the CPU goes way up.

Comment by Cory Thomas [ 09/Jul/15 ]

I can not replicate the issue with a 3 node replica set running locally (3 mongo daemons running on localhost) on my OSX machine. I can reliably replicate the issue on 4 EC2 linux instances. 3 machines running mongo daemons and one machine running the provided application.

How I run it:

  1. Build the shaded jar: mvn clean package
  2. Upload the jar: scp target/mongo-driver-test-1.0-SNAPSHOT.jar host:~
  3. Run the jar: ssh ec2-user@host java -jar /home/ec2-user/mongo-driver-test-1.0-SNAPSHOT.jar
Comment by Ross Lawley [ 09/Jul/15 ]

Unfortunately, I've been unable to reproduce with your example code.

I've run it in IntelliJ and as a standalone jar, both times when taking down members of the replicaset I've seen the expected connection refused error messages. I haven't seen CPU usage rise and I've also been able to stop and start the same member of the replicaset multiple times without any visible impact on CPU.

Something is missing in my test case - how are you running the local app?

Comment by Cory Thomas [ 08/Jul/15 ]

It appears I was mistaken when I said stopping the master or a secondary causes the issue. Stopping the master does not appear to cause a problem; only stopping a secondary. In fact, I can kill the master, start it again so it comes up as a secondary (driver handles that fine), and then immediately stop it again (driver goes crazy).

I attached a very simple java app that I am now using to reproduce the problem. All it does is create a MongoClient object and then sleep.

Comment by Ross Lawley [ 08/Jul/15 ]

Thanks, let me try and reproduce.

Comment by Cory Thomas [ 08/Jul/15 ]

ross@10gen.com

  1. It is a web server waiting for requests. It does not do any querying of mongo except when handling a request and there are no requests coming in.
  2. I put the driver configuration options in the ticket description. All other options are left at their defaults. I can get the mongod logs if you think it would be helpful.
  3. I stopped only one instance in the replica set with init: service mongod stop. The init script does kill -TERM (and a kill -KILL if that doesn't work).
Comment by Ross Lawley [ 08/Jul/15 ]

Hi cthomas,

I have a couple of questions:

1. Can you describe in more detail what the application is doing - when you say idling I'm not sure what that means? Is it in some event loop checking mongodb, on a web server awaiting requests.
2. How you have configured the driver? if possible could you include your connection settings and add the logs from mongod.
3. How are you stopping the mongod replicaset? db.shutdownServer(), via kill <pid> or mongod --shutdown etc..

Ross

Generated at Thu Feb 08 08:55:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.