[JAVA-2445] NullPointerException in AsyncQueryBatchCursor Created: 10/Feb/17 Updated: 27/Jul/17 Resolved: 01/Mar/17 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Async |
| Affects Version/s: | 3.4.2 |
| Fix Version/s: | 3.5.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Stefan Wachter | Assignee: | Ross Lawley |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
CloudFoundry, ElasticRuntime |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
The stacktrace shown below happens on client side after switching to MongoDB 3.4 (server side). After the exception, MongoDB access is stalled in our application. Shortly before the exception occured the following log entry is output:
|
| Comments |
| Comment by Ross Lawley [ 01/Mar/17 ] | |||||||||||||||||||||||||||||||||||
|
Fix now in master. | |||||||||||||||||||||||||||||||||||
| Comment by Alireza Mohamadi [X] [ 20/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
Hi. I have the same problem. This is stacktrace:
and I also use AsyncCursor.close() inside my code. | |||||||||||||||||||||||||||||||||||
| Comment by Stefan Wachter [ 15/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
Thanks for looking into that! We are looking forward to the 3.5. driver - hopefully coming soon. | |||||||||||||||||||||||||||||||||||
| Comment by Ross Lawley [ 14/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
Many thanks swachter, It looks like there is a race condition as you mentioned where the cursor is closed via the unsubscribe method in the reactive streams driver while there is a GetMore call is inflight. We're looking at a fix to handle this scenario correctly. Ross | |||||||||||||||||||||||||||||||||||
| Comment by Ross Lawley [ 14/Feb/17 ] | |||||||||||||||||||||||||||||||||||
| Comment by Stefan Wachter [ 14/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
We do not call AsyncQueryBatchCursor.close() ourselves. All database access uses the Publisher interface provided by the reactive streams driver. (One more hint: the "limit" is 0, i.e. limitReached() will never be true and killCursor() gets only called when AsyncQueryBatchCursor.close() is called.) Guess: In our application we scan through the stream of results returned by a FindPublisher. If we have enough information we cancel the subscription. Maybe the reactive streams driver closes the cursor when it receives a cancel signal. Later the AsyncQueryBatchCursor receives more results and hits the NPE. | |||||||||||||||||||||||||||||||||||
| Comment by Ross Lawley [ 13/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
Many thanks swachter, the fact that both connectionSource and the cursor are null is insightful. Do you call `AsyncQueryBatchCursor#close` in your application? Another area I shall explore is the different code paths against MongoDB 3.4 compared to MongoDB 3.0 when using find (MongoDB 3.4 has a specific find command). Ross | |||||||||||||||||||||||||||||||||||
| Comment by Stefan Wachter [ 13/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
Unfortunately, I have no (simple) repeatable test case. I set a conditional breakpoint at AsyncQueryBatchCursor line 254 and found that the connectionSource member is null (and the "cursor" member too). | |||||||||||||||||||||||||||||||||||
| Comment by Ross Lawley [ 13/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
Thanks for the feedback swachter,
Interesting, do you have a repeatable test case or scenario? It will really help with the tracking down and debugging the cause of the issue. Ross | |||||||||||||||||||||||||||||||||||
| Comment by Stefan Wachter [ 13/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
Note: The NPE happens also in a simple setup where only a single MongoDB instance is running. | |||||||||||||||||||||||||||||||||||
| Comment by Stefan Wachter [ 13/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
The server logs do not show anything particular with regard to server address issues. However, it shows that now and then new connections are opened. I switched on trace logging for "org.mongodb.driver.connetion". I could tracked down the last usage of a connection that was not returned to the connection pool and found that the reported NPE causes that a connection is not returned to the pool. In particular the method handleGetMoreQueryResults() is not called where the connection would be released. | |||||||||||||||||||||||||||||||||||
| Comment by Stefan Wachter [ 10/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
On the client side we used the mongodb-driver-reactivestreams 1.2.0 on top of mongodb-driver-async 3.3.0 together with MongoDB server 3.0. After the Server was updated we experience the failures and updated the client libraries to 1.3.0 and 3.4.2, respectively. Regarding the "stallment" our application log shows the following failure:
I did not expect that this can happen when using the mongodb-driver-async. Is there a way to monitor the connection pool? | |||||||||||||||||||||||||||||||||||
| Comment by Ross Lawley [ 10/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
Thanks swachter, Also can you confirm what version of MongoDB and Java driver you were running previously and didn't encounter this error. Ross | |||||||||||||||||||||||||||||||||||
| Comment by Stefan Wachter [ 10/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
Additional information: we use the mongodb-driver-reactivestreams 1.3.0 on top of mongodb-driver-async 3.4.2. In the debug output of the MongoDB Java Driver I find nothing unusual. There are continuously asynchronous commands started (updates, finds, getMore, inserts) and these command executions complete. At the same time every 8 seconds all three nodes of the MongoDB cluster are checked and the cluster description is updated. The cluster description does not change during the time period where the exception occurs. The cluster description look like this:
The application gets "stalled" in the following way: We use the org.reactivestreams.Publisher interface provided by the reactive streams MongoDB driver to execute our commands. We track the currently "running" commands by incrementing a counter whenever we receive a subscription from a Mongo publisher and we decrement that counter whenever "onError" or "onComplete" is called on the subscriber or when we "cancel" the subscription. The number of concurrent command executions increases steadily to 35 after the mentioned exception occurred (see attached screenshot). (NOTE: We found that the mentioned exception is also raised without the system getting stalled afterwards. Maybe it was just be incident this time.) PS: We asked our server admins for the server logs. This may take a while. | |||||||||||||||||||||||||||||||||||
| Comment by Ross Lawley [ 10/Feb/17 ] | |||||||||||||||||||||||||||||||||||
|
Hi swachter, Thanks for the ticket, I can see the error handling callback was triggered due to an error in a nested callback. From the stacktrace I can see that getting the address from the server failed and caused a NPE: connectionSource.getServerDescription().getAddress(). This is unusual, are there any logs from the driver or the server indicating any errors?
Could you elaborate? I would like to understand more to determine the cause of the error handling callback being triggered. Many thanks, Ross |