[JAVA-771] Repeated CursorNotFound exceptions on long-running process following oplog Created: 25/Feb/13  Updated: 12/Mar/13  Resolved: 12/Mar/13

Status: Closed
Project: Java Driver
Component/s: None
Affects Version/s: 2.10.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Cardon Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

RHEL 6.x86_64



 Description   

The custom oplog follower we're using creates a tailable cursor as follows:

mongo.getDB("foo").getCollection("bar").find(criteria).limit(0).skip(0)
.hint(new BasicDBObject("$natural", 1))
.addOption(Bytes.QUERYOPTION_TAILABLE)
.addOption(Bytes.QUERYOPTION_AWAITDATA)
.addOption(Bytes.QUERYOPTION_NOTIMEOUT);

After a couple million writes to the database, the cursor throws an exception like this:

com.mongodb.MongoException$CursorNotFound: cursor 0 not found on server localhost/127.0.0.1:10120
at com.mongodb.DBApiLayer$Result.init(DBApiLayer.java:394)
at com.mongodb.DBApiLayer$Result._advance(DBApiLayer.java:454)
at com.mongodb.DBApiLayer$Result.hasNext(DBApiLayer.java:421)
at com.mongodb.DBCursor._hasNext(DBCursor.java:464)
at com.mongodb.DBCursor.hasNext(DBCursor.java:484)

The process waits a second and then attempts to create a new cursor, using the same method. This succeeds, but the same exception is thrown 5-10 seconds later. Once the initial exception is thrown, subsequent exceptions occur very frequently (the initial exception may take a long time to occur after the process starts). This exception occurs so frequently, that our process falls far behind in processing oplog entries.



 Comments   
Comment by Jeffrey Yemin [ 12/Mar/13 ]

David,

Thanks for getting back to us. I'm going to close this and take a TODO on updating documentation.

Comment by David Cardon [ 12/Mar/13 ]

Yes, thank you. Adjusting the query (eliminating the $and) and using the QUERYOPTION_OPLOGREPLAY option resolves the issue. For future developers, the documentation should probably indicate that this option is required when implementing custom oplog followers (since it's not immediately obvious that it should be).

Thanks again for your help,

--Dave

Comment by Jeffrey Yemin [ 09/Mar/13 ]

Any updates?

Comment by David Cardon [ 26/Feb/13 ]

Sorry, I should have clarified: I had made all of the changes to the code you mentioned, with the exception of removing the sleep. I originally had the sleep in there since I thought it might help with recovery from any issues with the cursor (connectivity, etc). I have now removed it.

I will adjust the query and remove the $and operator. I think it may be a vestige from another query process we used in the past.

I have not yet encountered the problem with these new code changes, though it is hard to say if it is resolved yet. After hammering it for a couple of days, I can tell you whether this fixes the problem or not. (i.e., time will increase my confidence in the process, since usually this would happen after between 1-2 days of running the oplog follower).

Comment by Jeffrey Yemin [ 26/Feb/13 ]

You don't need to use $and for this type of query. Query criteria are implicitly and-ed together.

Did these changes make a difference, or are you still experiencing the problem? Why do you feel that you need to sleep, especially if the oplog tailer is falling behind?

Regards,
Jeff

Comment by David Cardon [ 26/Feb/13 ]

OK. I've already done everything but remove the sleep from the cursor loop.

Incidentally, I was constructing my criteria using an $and operator, like this:

criteria = {$and: [
{ts: {$gt: Timestamp(...,...)}},
{ns: {$not: /(dontfollow|nofollow)/}},
{fromMigrate: {$ne:true}}]
}

This throws an exception when the QUERYOPTION_OPLOGREPLAY option is selected. I modified the criteria to this:
criteria = {$and: [
{ns: {$not: /(dontfollow|nofollow)/}},
{fromMigrate: {$ne:true}}],
ts: {$gt: Timestamp(...,...)}
}

Comment by Jeffrey Yemin [ 26/Feb/13 ]

Can you try to use code that looks more like this:

    private static void tailOplog(final DBCollection oplog) throws InterruptedException {
        // start at now
        BSONTimestamp timestamp = new BSONTimestamp((int) (new Date().getTime() / 1000), 0);
        int numRead = 0;
        for (; ; ) {
            DBCursor cursor = oplog.find(new BasicDBObject("ts", new BasicDBObject("$gt", timestamp)))
                    .addOption(Bytes.QUERYOPTION_TAILABLE)
                    .addOption(Bytes.QUERYOPTION_AWAITDATA)
                    .addOption(Bytes.QUERYOPTION_NOTIMEOUT)
                    .addOption(Bytes.QUERYOPTION_OPLOGREPLAY);
            try {
                while (cursor.hasNext()) {
                    DBObject next = cursor.next();
                    timestamp = (BSONTimestamp) next.get("ts");
                    if (++numRead % 10000 == 0) {
                        System.out.println("R: " + numRead);
                    }
                }
            } catch (Exception e) {
                e.printStackTrace();
            } finally {
                cursor.close();
            }
        }
    }

A few notes:

  • I use a query criteria on "ts" so that it doesn't read back in time
  • I use Bytes.QUERYOPTION_OPLOGREPLAY to make the query on "ts" efficient even though there is no index on "ts"
  • I don't think the sleep is necessary, so I don't include it
Comment by David Cardon [ 26/Feb/13 ]

That's right, I'm querying the oplog, so the collection lookup actually looks like this: mongo.getDB("local").getCollection("oplog.$main")
(In my first submission, I had replaced local method variables with 'foo.bar', but 'local.oplog.$main' is actually what gets passed down).

From my observation a large percentage (> 90%) of the oplog traffic satisfies the criteria, so it is not very selective. Essentially, our oplog follower is only looking for traffic on particular collections, so the criteria boils down to a regex which identifies the collections of interest for the process.

Comment by Jeffrey Yemin [ 26/Feb/13 ]

Thanks for the report. A few questions:

What are the criteria you're searching on? How selective is it? Also, I assume you're actually querying the oplog, and not foo.bar, right?

Generated at Thu Feb 08 08:53:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.