[SERVER-36821] Cursor id not found even though it did not time out Created: 23/Aug/18  Updated: 23/Aug/18  Resolved: 23/Aug/18

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 3.6.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Youri Tolstoy Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-36808 Server closes cursors that are still ... Closed
Operating System: ALL
Steps To Reproduce:

I'm putting together a repo to reproduce in a systematic way but so far I'm only able to have the error happening in the main repo. I'll update as soon I've a reproducible way.

If that can help:

  • The app is connected to two databases
  • Each connection has a pool of 7
  • Setting higher log levels did not show any error on the server side
  • For the test 20 cursors are instantiated each with the same query
  • When one fails all the others start to fail too
  • It happens very randomly sometimes you've to wait couple of seconds for the error to happen sometimes it's almost immediate
  • Server is start with --noJournal, running as a container
Participants:

 Description   

Getting cursor id not found errors from the server even though multiple getMore where emitted soon before. The cursor is not up since more than 10mi and anyhow if previous getMore worked within less than 10min this shouldn't happen.

Cursor is started and the error occurs at 2018-08-23T11:05:28.528Z

Many getMore are occuring the last 5 ones being respectively at:

  • [2018-08-23T11:05:25.467Z] schedule getMore call for query {"bson":{},"ns":"fdp-collector.$cmd","query":{"find":"items","filter":
    Unknown macro: {"contents.iid"}

    ,"projection":{},"batchSize":100,"returnKey":false,"showRecordId":false,"lsid":{"id":"TIRV4MZNTSON2fMvPNiSpw=="}},"options":{"numberToSkip":0,"numberToReturn":1,"checkKeys":false,"returnFieldSelector":null,"serializeFunctions":false,"ignoreUndefined":false},"numberToSkip":0,"numberToReturn":1,"returnFieldSelector":null,"requestId":154,"serializeFunctions":false,"ignoreUndefined":false,"maxBsonSize":16777216,"checkKeys":false,"batchSize":1,"tailable":false,"slaveOk":false,"oplogReplay":false,"noCursorTimeout":false,"awaitData":false,"exhaust":false,"partial":false,"documentsReturnedIn":"firstBatch"} cursorId: 19000582380

  • [2018-08-23T11:05:26.378Z] schedule getMore call for query {"bson":{},"ns":"fdp-collector.$cmd","query":{"find":"items","filter":
    Unknown macro: {"contents.iid"}

    ,"projection":{},"batchSize":100,"returnKey":false,"showRecordId":false,"lsid":{"id":"TIRV4MZNTSON2fMvPNiSpw=="}},"options":{"numberToSkip":0,"numberToReturn":1,"checkKeys":false,"returnFieldSelector":null,"serializeFunctions":false,"ignoreUndefined":false},"numberToSkip":0,"numberToReturn":1,"returnFieldSelector":null,"requestId":154,"serializeFunctions":false,"ignoreUndefined":false,"maxBsonSize":16777216,"checkKeys":false,"batchSize":1,"tailable":false,"slaveOk":false,"oplogReplay":false,"noCursorTimeout":false,"awaitData":false,"exhaust":false,"partial":false,"documentsReturnedIn":"firstBatch"}cursorId: 19000582380

  • [2018-08-23T11:05:26.434Z] schedule getMore call for query {"bson":{},"ns":"fdp-collector.$cmd","query":{"find":"items","filter":
    Unknown macro: {"contents.iid"}

    ,"projection":{},"batchSize":100,"returnKey":false,"showRecordId":false,"lsid":{"id":"TIRV4MZNTSON2fMvPNiSpw=="}},"options":{"numberToSkip":0,"numberToReturn":1,"checkKeys":false,"returnFieldSelector":null,"serializeFunctions":false,"ignoreUndefined":false},"numberToSkip":0,"numberToReturn":1,"returnFieldSelector":null,"requestId":154,"serializeFunctions":false,"ignoreUndefined":false,"maxBsonSize":16777216,"checkKeys":false,"batchSize":1,"tailable":false,"slaveOk":false,"oplogReplay":false,"noCursorTimeout":false,"awaitData":false,"exhaust":false,"partial":false,"documentsReturnedIn":"firstBatch"}cursorId: 19000582380

  • [2018-08-23T11:05:27.096Z] schedule getMore call for query {"bson":{},"ns":"fdp-collector.$cmd","query":{"find":"items","filter":
    Unknown macro: {"contents.iid"}

    ,"projection":{},"batchSize":100,"returnKey":false,"showRecordId":false,"lsid":{"id":"TIRV4MZNTSON2fMvPNiSpw=="}},"options":{"numberToSkip":0,"numberToReturn":1,"checkKeys":false,"returnFieldSelector":null,"serializeFunctions":false,"ignoreUndefined":false},"numberToSkip":0,"numberToReturn":1,"returnFieldSelector":null,"requestId":154,"serializeFunctions":false,"ignoreUndefined":false,"maxBsonSize":16777216,"checkKeys":false,"batchSize":1,"tailable":false,"slaveOk":false,"oplogReplay":false,"noCursorTimeout":false,"awaitData":false,"exhaust":false,"partial":false,"documentsReturnedIn":"firstBatch"}cursorId: 19000582380',
    '[2018-08-23T11:05:27.333Z] schedule getMore call for query {"bson":{},"ns":"fdp-collector.$cmd","query":{"find":"items","filter":

    Unknown macro: {"contents.iid"}

    ,"projection":{},"batchSize":100,"returnKey":false,"showRecordId":false,"lsid":{"id":"TIRV4MZNTSON2fMvPNiSpw=="}},"options":{"numberToSkip":0,"numberToReturn":1,"checkKeys":false,"returnFieldSelector":null,"serializeFunctions":false,"ignoreUndefined":false},"numberToSkip":0,"numberToReturn":1,"returnFieldSelector":null,"requestId":154,"serializeFunctions":false,"ignoreUndefined":false,"maxBsonSize":16777216,"checkKeys":false,"batchSize":1,"tailable":false,"slaveOk":false,"oplogReplay":false,"noCursorTimeout":false,"awaitData":false,"exhaust":false,"partial":false,"documentsReturnedIn":"firstBatch"}cursorId: 19000582380

  • [2018-08-23T11:05:27.914Z] schedule getMore call for query {"bson":{},"ns":"fdp-collector.$cmd","query":{"find":"items","filter":
    Unknown macro: {"contents.iid"}

    ,"projection":{},"batchSize":100,"returnKey":false,"showRecordId":false,"lsid":{"id":"TIRV4MZNTSON2fMvPNiSpw=="}},"options":{"numberToSkip":0,"numberToReturn":1,"checkKeys":false,"returnFieldSelector":null,"serializeFunctions":false,"ignoreUndefined":false},"numberToSkip":0,"numberToReturn":1,"returnFieldSelector":null,"requestId":154,"serializeFunctions":false,"ignoreUndefined":false,"maxBsonSize":16777216,"checkKeys":false,"batchSize":1,"tailable":false,"slaveOk":false,"oplogReplay":false,"noCursorTimeout":false,"awaitData":false,"exhaust":false,"partial":false,"documentsReturnedIn":"firstBatch"}cursorId: 19000582380

As you can see the error happens less than a second after the last getMore



 Comments   
Comment by Kelsey Schubert [ 23/Aug/18 ]

Thanks for the report and additional information, we'll continue the investigation in SERVER-36808.

Kind regards,
Kelsey

Comment by Youri Tolstoy [ 23/Aug/18 ]

So this can be closed in favor of SERVER-36808

I've put together a little repo that enables to reproduce in a consistent way at: https://github.com/YouriT/cursor-not-found

Comment by Youri Tolstoy [ 23/Aug/18 ]

I can't edit the issue but found out in the logs that just before the last getMore fails I've the following:

{"log":"2018-08-23T12:05:27.377+0000 I COMMAND  [thread145] command config.$cmd command: createIndexes { createIndexes: \"system.sessions\", indexes: [ { key: { lastUse: 1 }, name: \"lsidTTLIndex\", expireAfterSeconds: 1800 } ], $db: \"config\" } numYields:0 reslen:134 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_msg 0ms\n","stream":"stdout","time":"2018-08-23T12:05:27.377137596Z"}
{"log":"2018-08-23T12:05:27.377+0000 I WRITE    [thread145] update config.system.sessions command: { q: { _id: { id: UUID(\"3d14c8be-a204-4008-bc7b-72b00dd45334\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) } }, u: { $currentDate: { lastUse: true } }, multi: false, upsert: true } planSummary: IDHACK keysExamined:0 docsExamined:0 nMatched:0 nModified:0 upsert:1 keysInserted:2 numYields:0 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 1, W: 1 } }, Collection: { acquireCount: { w: 1 } } } 0ms\n","stream":"stdout","time":"2018-08-23T12:05:27.377250672Z"}
{"log":"2018-08-23T12:05:27.377+0000 I COMMAND  [thread145] command config.$cmd command: update { update: \"system.sessions\", ordered: false, allowImplicitCollectionCreation: false, $db: \"config\" } numYields:0 reslen:155 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 1, W: 1 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_msg 0ms\n","stream":"stdout","time":"2018-08-23T12:05:27.377260862Z"}
 
... plenty of killed sessions AND the one my cursor is using
 
{"log":"2018-08-23T12:05:27.377+0000 I WRITE    [thread145] remove config.system.sessions command: { q: { _id: { id: UUID(\"bcd6b15b-ee92-4d17-939e-49b09c9bb036\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) } }, limit: 0 } planSummary: IDHACK keysExamined:0 docsExamined:0 ndeleted:0 numYields:0 locks:{ Global: { acquireCount: { r: 12, w: 12 } },
 
...
 
{"log":"2018-08-23T12:05:27.377+0000 I COMMAND  [thread145] command config.$cmd command: delete { delete: \"system.sessions\", ordered: false, $db: \"config\" } numYields:0 reslen:29 locks:{ Global: { acquireCount: { r: 24, w: 24 } }, Database: { acquireCount: { w: 23, W: 1 } }, Collection: { acquireCount: { w: 23 } } } protocol:op_msg 0ms\n","stream":"stdout","time":"2018-08-23T12:05:27.377654767Z"}
{"log":"2018-08-23T12:05:27.377+0000 I COMMAND  [thread145] command config.system.sessions command: find { find: \"system.sessions\", filter: { _id: { $in: [ { id: UUID(\"bcd6b15b-ee92-4d17-939e-49b09c9bb036\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, { id: UUID(\"c94da6e9-7487-4c2e-bcbb-c983062218ae\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) } ] } }, projection: { _id: 1 }, batchSize: 2, singleBatch: true, limit: 2, $db: \"config\" } planSummary: IXSCAN { _id: 1 } keysExamined:2 docsExamined:0 cursorExhausted:1 numYields:0 nreturned:1 reslen:180 locks:{ Global: { acquireCount: { r: 54, w: 24 } }, Database: { acquireCount: { r: 15, w: 23, W: 1 } }, Collection: { acquireCount: { r: 15, w: 23 } } } protocol:op_msg 0ms\n","stream":"stdout","time":"2018-08-23T12:05:27.377893554Z"}
{"log":"2018-08-23T12:05:27.414+0000 I COMMAND  [conn141] command fdp-collector.items command: getMore { getMore: 18007050333, collection: \"items\", batchSize: 100, lsid: { id: UUID(\"bcd6b15b-ee92-4d17-939e-49b09c9bb036\") }, $readPreference: { mode: \"secondaryPreferred\" }, $db: \"fdp-collector\" } cursorid:18007050333 numYields:0 reslen:120 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_query 0ms\n","stream":"stdout","time":"2018-08-23T12:05:27.415045353Z"}

If I understand well the session is being removed which might make the next getMore fail?

This remembers me SERVER-34810 but it was fixed in 3.6.6 so either I get it wrong either something wrong is still going on.

Edit: issuing refreshLogicalSessionCacheNow command, makes the error to raise so it seems very linked to 34810.

Generated at Thu Feb 08 04:44:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.