[JAVA-2034] Just after creating the mongo client, the mongo driver may lose a single operation Created: 10/Nov/15  Updated: 08/Dec/15  Resolved: 13/Nov/15

Status: Closed
Project: Java Driver
Component/s: Async
Affects Version/s: 3.1.0
Fix Version/s: 3.2.0

Type: Bug Priority: Critical - P2
Reporter: Peter Hendriks Assignee: Jeffrey Yemin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Java Source File ConcurrentRequestsDuringStartupExample2.java     Java Source File JAVA2034.java    

 Description   

When multiple operations are executed on an async mongo client just after it has been created, a single operation may be lost, and no callback is fired or timeout occurs. This may lead to execution timeout errors in production code and may affect all async mongo users that use the client directly after initialisation.

The attached example attempts to demonstrate the behavior. When the commented sleep() is uncommented, the example should always succeed. Otherwise, the example should fail most of the time with a "1 remaining operations" message. Note that sometimes, this example succeeds, please run it multiple times when the problem is not immediately present.



 Comments   
Comment by Githook User [ 13/Nov/15 ]

Author:

{u'username': u'jyemin', u'name': u'Jeff Yemin', u'email': u'jeff.yemin@10gen.com'}

Message: JAVA-2034: Stop interrupting the wait queue handler in BaseCluster, as it's possible that the interrupt flag will be checked somewhere else besides the wait queue handler itself.
Additionally, ensure that exceptions thrown in DefaultConnectionPool.getAsync are propagated to the callback.
Branch: 3.1.x
https://github.com/mongodb/mongo-java-driver/commit/8f3bebff4daa49da43e64d2242afa3a5707a887b

Comment by Githook User [ 13/Nov/15 ]

Author:

{u'username': u'jyemin', u'name': u'Jeff Yemin', u'email': u'jeff.yemin@10gen.com'}

Message: JAVA-2034: Stop interrupting the wait queue handler in BaseCluster, as it's possible that the interrupt flag will be checked somewhere else besides the wait queue handler itself.
Additionally, ensure that exceptions thrown in DefaultConnectionPool.getAsync are propagated to the callback.
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/87113ea591aac67df11b728a33139b7b29547b7d

Comment by Jeffrey Yemin [ 12/Nov/15 ]

Hi Peter,

We can reproduce this now and are looking for the root cause.

Thanks again,
Jeff

Comment by Peter Hendriks [ 12/Nov/15 ]

I've re-run the tests with your suggestions:

1. Each time the test fails, _id value 0 is the only one missing from the collection.
2. Nothing specific happens by raising the timeout to 60 seconds. I actually expected a timeout error to be returned from the driver, but this also does not occur: the test simply waits until the timeout expires and no logging occurs other than some cluster status debug messages.

Comment by Jeffrey Yemin [ 12/Nov/15 ]

Hi Peter,

A couple of follow-up questions:

  1. The documents have _id values from 0 to 499. Which of those 500 documents is missing from the collection?
  2. Does the same thing happen as you increase the timeout from 4 seconds to, say, 1 minute?

Thanks,
Jeff

Comment by Peter Hendriks [ 12/Nov/15 ]

Hi Jeff, thanks for the follow-up. If I run the JAVA2034 example, I get the same problem at the first iteration, and the following value is printed:

0: 1 remaining operations

In the database, 499 documents are inserted out of the expected 500. On my machine, I need to add at least a sleep of 110 millis to make the example succeed. Note that we observed this problem on Linux and OSX, Java 1.8.0_60. I hope that helps!

Comment by Jeffrey Yemin [ 10/Nov/15 ]

Hi Peter,

I modified your test program a bit (see attached JAVA2034.java), making the following changes:

  • Added a loop at the top so that it doesn't have to be run over and over to get it to fail, and dropped the collection at the end of the loop (and slept for a second)
  • Check the return value of CountDownLatch.await, and only exit if it returns false (indicating a timeout).
  • Added some debug output
  • Changed the inserted document to set the _id to the loop counter.

After 1000 iterations, I observed no failures. Please try running this modified test and report back the results. If you do see a failure, please check the collection to see if there are actually any documents missing of the expected 200, and if so, which one(s).

If you do get failures, I'd also like to know if they continue to occur as you bump up the timeout value for the await to higher and higher values.

Regards,
Jeff

Comment by Peter Hendriks [ 10/Nov/15 ]

Thanks mgardziejewski for the example code demonstrating the problem.

Generated at Thu Feb 08 08:56:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.