[JAVA-1206] Incorrect ping information causes continuous updating cluster information Created: 09/May/14  Updated: 23/Jun/15  Resolved: 23/Jun/15

Status: Closed
Project: Java Driver
Component/s: None
Affects Version/s: 2.12.0
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Jakub Kubrynski Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

linux



 Description   

After updating java driver to 2.12.0 gc activity raises for a several dozen percent. In logs I can find:
May 09, 2014 3:33:36 PM com.mongodb.ServerStateNotifier run
FINE: Checking status of s122712.dc2:27017
May 09, 2014 3:33:36 PM com.mongodb.BaseCluster updateDescription
FINE: Updating cluster description to {type=ReplicaSet, servers=[

{address=s122712.dc2:27017, type=ReplicaSetPrimary, averagePingTime=2883898000000 ms, state=Connected}

]

But ping gives:
ping s122712.dc2
PING s122712.dc2 (xx.xx.xx.xx) 56(84) bytes of data.
64 bytes from s122712.dc2 (xx.xx.xx.xx): icmp_req=1 ttl=64 time=0.322 ms
64 bytes from s122712.dc2 (xx.xx.xx.xx): icmp_req=2 ttl=64 time=0.359 ms

In fact there are thousands DBPort and MongoOptions objects created every second which kills servers in less than an hour.



 Comments   
Comment by Jeffrey Yemin [ 23/Jun/15 ]

Closing this, as I strongly suspect that it's been fixed in later patches in the 2.12 driver series. In particular, by JAVA-1202 and JAVA-1217.

Comment by MOISES J S TROVO [ 31/Jul/14 ]

We're using the 2.12.0 version and I can pinpoint this problem to this version because yesterday we rolled back our client to 2.11.0 and the problem is gone for now.

Yes, I saw it in the code and it's very weird because in 30 minutes it received this event from the cluster 3.5mi times (average of 1944 events/sec for a 2 nodes cluster!) .

I'm out of my desk right now but I will try it and see if it works better.

Comment by Jeffrey Yemin [ 31/Jul/14 ]

Moises,

Which driver version are you using? MultiServerCluster.handleReplicaSetMemberChanged does not add a new server to the cluster unless the server is not currently in the cluster. See https://github.com/mongodb/mongo-java-driver/blob/2.12.x/src/main/com/mongodb/MultiServerCluster.java#L212-219.

It would be useful if those experiencing this problem try the 2.12.3 patch release, as there have been a number of important fixes to the cluster monitoring implementation since 2.12.0. Anyone willing to try?

Regards,
Jeff

Comment by MOISES J S TROVO [ 29/Jul/14 ]

I'm with this same problem on my cluster, I ran Mission Control and there's a lot of pressure on GC coming from DBPort / ServerStateNotifier classes within MongoDB driver, about 3.5mi instances and 108GB of total memory usage for a 30min dump.

I'm looking at the code and it seems to be a problem with the method com.mongodb.MultiServerCluster.handleReplicaSetMemberChanged(ServerDescription) where it adds new servers to the cluster every time there's a change on the Cluster.

Comment by Jakub Kubrynski [ 10/May/14 ]

One time I saw MongoSocketException.Network objects. Unfortunately I don't have saved heap dumps containing those objects. But as I remember exception comes from ensureOpen method i have autoReconnect = false

Comment by Jeffrey Yemin [ 09/May/14 ]

Can you explain what you mean when you say that "there was Network exception in heap dump"? Can you add the exception stack trace (if you have it) to a comment on this issue?

Comment by Jakub Kubrynski [ 09/May/14 ]

Currently I pass only one server.

Comment by Jeffrey Yemin [ 09/May/14 ]

There should be a log message generated from here: https://github.com/mongodb/mongo-java-driver/blob/2.12.0/src/main/com/mongodb/ServerStateNotifier.java#L102. Perhaps there is a typo in the name of one of the servers in the seed list that the application is passing to the Mongo constructor?

Comment by Jakub Kubrynski [ 09/May/14 ]

Yep - I forgot that there was Network exception in heap dump. Maybe you can also extends logging in that part, because there is nothing in logs about that. It was thrown from DBPort.ensureOpen as I remember

Comment by Jeffrey Yemin [ 09/May/14 ]

I don't think it will solve it either, but might give better diagnostics. ServerStateNotifier caches DBPort instances, and only constructs new ones when an IOException is thrown. Given that, we still have not been able to determine the root cause of the high allocation rate.

Comment by Jakub Kubrynski [ 09/May/14 ]

I'll check that on Monday. But I'm not sure it solves giant allocation rate od DBPort

Comment by Jeffrey Yemin [ 09/May/14 ]

Would it be possible for you to test against the latest 2.12.2-SNAPSHOT? It dedicates a thread per server for monitoring, and each thread name includes the server name. That way we can see if it's one particular server that's causing the issue.

The latest SNAPSHOT is at https://oss.sonatype.org/content/repositories/snapshots/org/mongodb/mongo-java-driver/2.12.2-SNAPSHOT/mongo-java-driver-2.12.2-20140507.143852-5.jar.

I opened JAVA-1207 about the incorrect log message, but the fix is in code review and not yet in the SNAPSHOT.

Comment by Jakub Kubrynski [ 09/May/14 ]

My fault - I've fixed ping time only once. I've used profiler and there was pretty high allocation rate on those objects. There was only one Mongo object instance. Problem is that GC times are being longer and longer, CPU usage is 100% without any user traffic. As I said - all objects are created in ThreadPoolExecutor and DBPort object are created by ServerStateNotifier in cluster-2-thread-1

Comment by Jeffrey Yemin [ 09/May/14 ]

Thanks for spotting the bug in the logging statement. Reversing the bug gives us an actual ping time of a bit under 3 ms.

        long loggedValue = 2883898000000L;
        System.out.println(loggedValue);
        long averagePingTimeNanos = MILLISECONDS.convert(loggedValue, NANOSECONDS);
        System.out.println(averagePingTimeNanos + " nanoseconds");
        System.out.println(MILLISECONDS.convert(averagePingTimeNanos, NANOSECONDS) + " ms");

prints:

2883898000000
2883898 nanoseconds
2 ms

which is much closer to the expected ping time. How did you calculate the 48 minute value?

Looking a bit further, this bug is only in the log message, and the calculations everywhere else that use the raw value look to be correct. So I think the incorrect ping time reported in the logs is not the root cause of the problem.

What we still need to figure out is why there appear to be thousands DBPort and MongoOptions objects created every second. ServerStateNotifier is only running once every 5 seconds for each server in the replica set, so long as all members are reachable, and your logs seem to confirm that behavior. Can you tell me a bit about how you are determining the rate of object creation for DBPort and MongoOptions? Also, when you say that it's killing the server, what do you mean exactly? Are threads throwing OutOfMemoryError?

Regards,
Jeff

Comment by Jakub Kubrynski [ 09/May/14 ]

Those logs contains only lines I've pasted repeated every 5sec. I've checked and I can execute db.isMaster() on that host with the same user that is been used by my application. We have two hosts, two passives and one arbiter. All available from the application server with pings < 0.4ms to hosts and arbiter and < 10ms to passives.
isMaster is true for that server.

Also you're incorrectly converting nanos to miliseconds in https://github.com/mongodb/mongo-java-driver/blob/2.12.0/src/main/com/mongodb/ServerDescription.java#L486, because

NANOSECONDS.convert(averagePingTimeNanos, MILLISECONDS)

will convert millis to nanos. You have to use

NANOSECONDS.toMillis(averagePingTimeNanos)

instead.

edit: I've converted time you've reported and it's about 48 minutes. Those logs were collected after about 2-3 minutes after starting the server so it's definitely incorrect

Comment by Jeffrey Yemin [ 09/May/14 ]

The ping time that the driver reports it not the result of an actual ping, so it's confusing that the driver calls it that. It's the average of the response time of the server to the 'ismaster' command, which is what ServerStateNotifier is calling. For reference: https://github.com/mongodb/mongo-java-driver/blob/2.12.0/src/main/com/mongodb/ServerStateNotifier.java#L131-140

To assist us in getting to the root cause of this issue, can you provide an attachment containing all the log messages generated by the driver for a 30 second period? If that's not something you can do in a public forum, I can make this ticket private.

Regards,
Jeff

Comment by Jakub Kubrynski [ 09/May/14 ]

No such servers in replica set, and as you see mongo driver reports unrealistic ping time to the server which is available and ping is below 0.4 ms

Comment by Jeffrey Yemin [ 09/May/14 ]

Are there any servers in the replica set that are not reachable by the application? If so, can you try increasing the value of com.mongodb.MongoClientOptions.Builder#heartbeatConnectRetryFrequency to something like 1000 ms?

Comment by Jakub Kubrynski [ 09/May/14 ]

Not only - those objects are also created in ServerStateNotifier - and I've seen that DBPort and MongoOptions is being allocated in ThreadPoolExecutor.
Second thing is that we're using SpringData Mongo and we've updated only the Mongo Java Driver. After rolling back to 2.11.3 everything works great

Comment by Jeffrey Yemin [ 09/May/14 ]

MongoOptions instances are only created when you create a new Mongo instance. Can you see where in your application you are creating all those MongoOptions instances?

Generated at Thu Feb 08 08:54:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.