[SERVER-34053] Cursor not found error when running long query on secondary with noCursorTimeout Created: 21/Mar/18  Updated: 25/Jun/21  Resolved: 17/May/18

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 3.6.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Emily Stolfo Assignee: Misha Tyulenev
Resolution: Duplicate Votes: 3
Labels: SWNA
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-36332 CursorNotFound error in GetMore on a ... Closed
Duplicate
duplicates SERVER-34810 Session cache refresh can erroneously... Closed
Related
is related to RUBY-1312 noCursorTimeout regression from 2.4.3... Closed
is related to RUBY-1318 No Cursor Found error using noCursorT... Closed
Operating System: ALL
Sprint: Sharding 2018-05-21
Participants:

 Description   

I’ve had two Ruby tickets reporting that a user gets a cursor id not found error with the latest version of the server and the Ruby driver. I’ve been able to reproduce it (see below for code and logs).

Here are the details:

  • The query is done with read preference secondaryPreferred
  • The query has noCursorTimeout set to true
  • The query uses an implicit session
  • The sequence of operations is: find (success), getmore (success), getmore (failure/no cursor id found)
  • As the user iterates over the query results, there is sometimes a few minutes between getMore requests. This value has been 7 minutes and 10 minutes.

I confirmed that:

  • The driver is indeed setting noCursorTimeout on the query
  • The same lsid is being used for the original find and following getMores
  • Neither the cursor nor the session are being prematurely ended before the getMore failure. (no endSessions commands or killCursors commands found in the logs)

I can provide the user's primary and secondary logs, as well as application logs.

Here are the Ruby tickets: RUBY-1312, RUBY-1318

I was able to reproduce this with the following Ruby code:
https://gist.github.com/estolfo/f6a46b68fdf3db9be102ba2236e7abd5

This is the driver log:
https://gist.github.com/estolfo/1bc93c544e25b71678d9389a306f5292



 Comments   
Comment by Sahi kakkar [ 25/Jun/21 ]

I am  using 2.11.1 version and  getting same issue MongoDB.Driver.MongoCursorNotFoundException: Cursor 2433389218078643918 not found on server {Primary} using connection 78488. I am running a aggregation query to get data in consumer channel and processing async.  based on my analysis i found below could be the issue.

but not sure.

  1. session time out (I tried creating a session and refreshing after five mins by doing cursor.listdatabasenames)
  2. cursor timeout (there is no option to set no cursor timeout  (idle time out) for aggregation query. i think need this because processing take substantial amout of time.

I would really appreciate help on this.

Comment by Avner Cohen [ 27/Oct/19 ]

In this doc:

https://docs.mongodb.com/manual/core/read-preference-mechanics/#read-preference-for-replica-sets

I see: 

The read preference is re-evaluated for each operation.

I am not sure now, how can a server side cursor work with this limitation.

 

Would appreciate help on that.

Comment by Avner Cohen [ 23/Oct/19 ]

Still seeing it, even with latest 4.2.

After some investigation, and crossing data with other cases on the www (excluding all the cases of real 10 minutes timeouts)I suspect the issue is related to a query with secondaryPreferred.

Since cursor is server side, I am guessing that in some scenario, the query is hitting a different secondary. 

Is that in any way possible ?

Comment by Dan Dascalescu [ 08/Apr/19 ]

I'm still seeing this issue on mLab, with MongoDB server version: 3.6.8. I can't post the query here publicly, but I'm happy to grant MongoDB/mLab permission to look at our query and data if that helps. Or perhaps we need to upgrade the server version?

E QUERY    [js] Error: getMore command failed: {E QUERY    [js] Error: getMore command failed: { "ok" : 0, "errmsg" : "cursor id 97741904259 not found", "code" : 43, "codeName" : "CursorNotFound", "operationTime" : Timestamp(1554677331, 5), "$clusterTime" : { "clusterTime" : Timestamp(1554677331, 5), "signature" :

{ "hash" : BinData(0,"erU/NyFSNqzDCBeNyxhwSRjWXcQ="), "keyId" : NumberLong("6619795293317627907") }

}} :_getErrorWithCode@src/mongo/shell/utils.js:25:13DBCommandCursor.prototype._runGetMoreCommand@src/mongo/shell/query.js:815:1DBCommandCursor.prototype._hasNextUsingCommands@src/mongo/shell/query.js:846:9DBCommandCursor.prototype.hasNext@src/mongo/shell/query.js:854:1DBQuery.prototype.hasNext@src/mongo/shell/query.js:294:13DBQuery.prototype.forEach@src/mongo/shell/query.js:500:12@(shell):1:1

Comment by Misha Tyulenev [ 10/Sep/18 ]

jussi@vainu.io there is another issue that has similar symptoms tracked in SERVER-36986. This backport is not completed yet for 4.0 so this may be an issue you experience.

Comment by Jussi Kauppila [ 10/Sep/18 ]

Hi, 

We are using 4.0.1 in Atlas and are still experiencing this all the time. Longops with no_cursor_timeout set to true raise CursorNotFound error when ran with Secondary Read Preference. With Primary everything works as expected.

Patch that fixes https://jira.mongodb.org/browse/SERVER-34810 should be included in the 4.0.1. Maybe there is still something going on. I can look in to this deeper at some point but currently the fix for us is to run long queries from Primary.

We are using the latest pymongo 3.7.1 to connect. I think this should be reopened.

 

Comment by Misha Tyulenev [ 02/May/18 ]

Thanks david.storch! Its hard to say if its a dup without looking at the reproducible test case. The major difference is that SERVER-34810 is an intermittent issue while this one seems to be reproducible. Ill look more into it and get back, it might be related to LogicalSessionCache refresh logic.

Comment by David Storch [ 02/May/18 ]

misha.tyulenev, I found SERVER-34810 by diagnosing a build failure. Is there any chance that this ticket has the same root cause?

Comment by Misha Tyulenev [ 01/May/18 ]

emily.stolfo could you please clarify why the find command does not have clusterTime in it? is this the first command? https://gist.github.com/estolfo/1bc93c544e25b71678d9389a306f5292#file-no_cursor_found_driver_logs-rb-L98 
The command that succeeds https://gist.github.com/estolfo/1bc93c544e25b71678d9389a306f5292#file-no_cursor_found_driver_logs-rb-L112 has the cluster time. Not that it should batter but it shows that the tests were not done identically.
Thanks!

Comment by Misha Tyulenev [ 13/Apr/18 ]

Hi pavel.duchovny could you please include server logs? Thanks!

Comment by Charlie Swanson [ 23/Mar/18 ]

Moving this over to the Sharding backlog because we suspect the issue is caused by a session timeout. It's unclear to the query team if it's a driver issue or a server issue.

Comment by Emily Stolfo [ 22/Mar/18 ]

Hi david.storch

I patched the driver to not associate a session with the find and getmore operations and did not get the error with read preference secondaryPreferred. I've added the test log in the gist.

Comment by David Storch [ 21/Mar/18 ]

Thanks, Emily. I currently suspect that this is related to the session getting killed. Is it possible for you to run the repro in such a way that the cursor does not belong to a logical session? If the problem no longer repros, that would help bolster my case.

Comment by Emily Stolfo [ 21/Mar/18 ]

I am testing against version 3.6.3. I've just tested with read preference primary and I did not get the cursor not found error. I've added the driver logs to the bottom of the gist.

Comment by David Storch [ 21/Mar/18 ]

emily.stolfo, against which versions of the server can you reproduce the problem? Does your Ruby script continue to reproduce the problem when the cursor is open on the primary instead of a secondary?

Comment by Alberto Ornaghi [ 21/Mar/18 ]

Hi,

i'm the reporter of RUBY-1312, i can also add that the same issue manifested on a standalone machine (diffrent schema, same mongodb and mongo-driver versions), so it's not strictly related to the secondarPreferred option.

Generated at Thu Feb 08 04:35:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.