[JAVA-771] Repeated CursorNotFound exceptions on long-running process following oplog Created: 25/Feb/13 Updated: 12/Mar/13 Resolved: 12/Mar/13 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | None |
| Affects Version/s: | 2.10.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Cardon | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
RHEL 6.x86_64 |
||
| Description |
|
The custom oplog follower we're using creates a tailable cursor as follows: mongo.getDB("foo").getCollection("bar").find(criteria).limit(0).skip(0) After a couple million writes to the database, the cursor throws an exception like this: com.mongodb.MongoException$CursorNotFound: cursor 0 not found on server localhost/127.0.0.1:10120 The process waits a second and then attempts to create a new cursor, using the same method. This succeeds, but the same exception is thrown 5-10 seconds later. Once the initial exception is thrown, subsequent exceptions occur very frequently (the initial exception may take a long time to occur after the process starts). This exception occurs so frequently, that our process falls far behind in processing oplog entries. |
| Comments |
| Comment by Jeffrey Yemin [ 12/Mar/13 ] | |||||||||||||||||||||||||
|
David, Thanks for getting back to us. I'm going to close this and take a TODO on updating documentation. | |||||||||||||||||||||||||
| Comment by David Cardon [ 12/Mar/13 ] | |||||||||||||||||||||||||
|
Yes, thank you. Adjusting the query (eliminating the $and) and using the QUERYOPTION_OPLOGREPLAY option resolves the issue. For future developers, the documentation should probably indicate that this option is required when implementing custom oplog followers (since it's not immediately obvious that it should be). Thanks again for your help, --Dave | |||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 09/Mar/13 ] | |||||||||||||||||||||||||
|
Any updates? | |||||||||||||||||||||||||
| Comment by David Cardon [ 26/Feb/13 ] | |||||||||||||||||||||||||
|
Sorry, I should have clarified: I had made all of the changes to the code you mentioned, with the exception of removing the sleep. I originally had the sleep in there since I thought it might help with recovery from any issues with the cursor (connectivity, etc). I have now removed it. I will adjust the query and remove the $and operator. I think it may be a vestige from another query process we used in the past. I have not yet encountered the problem with these new code changes, though it is hard to say if it is resolved yet. After hammering it for a couple of days, I can tell you whether this fixes the problem or not. (i.e., time will increase my confidence in the process, since usually this would happen after between 1-2 days of running the oplog follower). | |||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 26/Feb/13 ] | |||||||||||||||||||||||||
|
You don't need to use $and for this type of query. Query criteria are implicitly and-ed together. Did these changes make a difference, or are you still experiencing the problem? Why do you feel that you need to sleep, especially if the oplog tailer is falling behind? Regards, | |||||||||||||||||||||||||
| Comment by David Cardon [ 26/Feb/13 ] | |||||||||||||||||||||||||
|
OK. I've already done everything but remove the sleep from the cursor loop. Incidentally, I was constructing my criteria using an $and operator, like this: criteria = {$and: [ This throws an exception when the QUERYOPTION_OPLOGREPLAY option is selected. I modified the criteria to this: | |||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 26/Feb/13 ] | |||||||||||||||||||||||||
|
Can you try to use code that looks more like this:
A few notes:
| |||||||||||||||||||||||||
| Comment by David Cardon [ 26/Feb/13 ] | |||||||||||||||||||||||||
|
That's right, I'm querying the oplog, so the collection lookup actually looks like this: mongo.getDB("local").getCollection("oplog.$main") From my observation a large percentage (> 90%) of the oplog traffic satisfies the criteria, so it is not very selective. Essentially, our oplog follower is only looking for traffic on particular collections, so the criteria boils down to a regex which identifies the collections of interest for the process. | |||||||||||||||||||||||||
| Comment by Jeffrey Yemin [ 26/Feb/13 ] | |||||||||||||||||||||||||
|
Thanks for the report. A few questions: What are the criteria you're searching on? How selective is it? Also, I assume you're actually querying the oplog, and not foo.bar, right? |