[SERVER-24899] "Cursor not found" even with no-timeout Created: 05/Jul/16  Updated: 31/May/19  Resolved: 05/Jul/16

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 3.2.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ben Lubar Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

(sorry, the database is large and contains information I can't share, but I can help debug locally)

Participants:

 Description   

https://gist.github.com/BenLubar/07335d7e6b4ab939c02517d59b3f8b80

I'm getting Cursor not found, cursor id: 28781399268 on line 444 even though I set no-timeout on line 36.

The query is only returning about 96k of the 141k documents it should return, so it's not closing the cursor due to it reaching the end.



 Comments   
Comment by Matt Broadstone [ 31/May/19 ]

Hey dandv,

Is the behavior of the cursor timeout documented in more detail somewhere?

Unfortunately the best docs I can find quickly are for the noCursorTimeout property documented in the find command.

I'll try to bulk answer the rest of these questions with the following: A "cursor" only truly exists on the server side. On the client side, this involve sending a command which returns a non-zero "cursor id", which the client may use to issue subsequent getMore commands. Each time a server receives a getMore from a client for a particular id it will reset the timer on that cursor. What I suspect happened in your case here is that you are taking more than 10min to do something between calls to next (e.g. getMore), so the server is closing your cursor.

Comment by Dan Dascalescu [ 30/May/19 ]

Thanks for the comprehensive answer, Matt! Is the behavior of the cursor timeout documented in more detail somewhere?

there's a possibility that you get stuck doing one of these operations long enough that the next time your while loop is iterated the cursor has been closed remotely

Does the above mean that one particular iteration of the while loop would have to take over 10 minutes? Or that more than 10 minutes need to pass since the cursor would be initialized? In order words,

  • does calling next() reset the cursor timeout?
  • or is the cursor timeout reset only when the driver fetches new data from the server, i.e. after all documents in the batch have been requested by the client?
Comment by Matt Broadstone [ 29/May/19 ]

Hey dandv,

It sounds to me like you are indeed running into the default 10min cursor timeout on the server.

This is all very dependent mostly on two factors: how big the documents are in your batches, and how much time you are spending processing between each call to next. As you increase your batch size, so will you increase the time to receive those batches as you will be bottlenecked by I/O here up to 16MB. Secondly, in your code above I can see sections like Process users as well as a bulk operation. I can't speak for whatever happens in the processing step (you will have to benchmark this), but you are also creating a bulk operation and awaiting the result of that too. Since you are doing all of this serially, there's a possibility that you get stuck doing one of these operations long enough that the next time your while loop is iterated the cursor has been closed remotely. 

There are a number of ways you could modify your code to get around this:

  • Don't use async/await
    As explained above, using async/await will execute these async operations serially by waiting for the result of the promise before moving forward. If you used promises or callbacks directly, you could do this work more asynchronously and would likely not run into this issue
  • Disable cursor timeout when calling find (by specifying timeout: false)
    This will disable the cursor time out on the server for this cursor. It seems like this is a pretty known workload, so its safe to disable the timeout (too many cursors with no timeout could run the risk of overloading the server though)
  • Modify the cursor timeout on the server
    We allow you to configure the default cursor timeout on the server, learn more about that here
  • Use streams
    Streams are somewhat well suited to this consumer/producer task. Essentially you would pipe the cursor into some processing stream, which would eventually create and execute bulk operations. This would ensure that you are receiving as many documents as the local server can handle at any time, and buffering up bulk operations to be executed in order. The objective benefit gained with this approach is the same as using promises/callbacks, as you would not be serially iterating through incoming documents

Hope that helps!
 

 

Comment by Dan Dascalescu [ 28/May/19 ]

I'm seeing the same problem (MongoError: cursor id 40280262609 not found) with the v3.2.5 NodeJS driver against a standalone MongoDB 4.0.9 instance, when a cursor.next() loop runs for 16 minutes (there's also a 6-minute timeout in the middle of the run. Maybe the 10-minute is a giveaway?

Here's the code:

const cursor = await users.find(
  { },
  {
    projection: { ... },
  }
).batchSize(100 * 100);
 
let user = await cursor.next();
while (user) {
  const usersInDb = { };const { nModified } = 
  // Get up to PAGE_SIZE users (<PAGE_SIZE at the end of the cursor)
  for (let i = 0; i < PAGE_SIZE && user; i++, user = await cursor.next())
    usersInDb[user._id] = user;
 
  // Process users...
  // ...
  const bulk = users.initializeUnorderedBulkOp();
  // ...
  bulk.find({ _id: id }).updateOne(userUpdateForDb);
 
  // Update users
  try {
    await bulk.execute();
  } catch (error) {
    ...
  }
}

The collection has ~65,000 documents, and I've reduced the batch size to 10,000 per this SO question in order to keep the cursor alive.

Also posted on GitHub at https://github.com/mongodb-js/mongodb-core/issues/199.

Comment by Ben Lubar [ 06/Jul/16 ]

Pull request submitted: https://github.com/go-mgo/mgo/pull/295

Comment by Ben Lubar [ 06/Jul/16 ]

The Go version is sending

        "query" : {
                "find" : "objects",
                "filter" : {
                        "_key" : "users:joindate"
                },
                "sort" : {
                        "score" : 1,
                        "value" : 1
                },
                "projection" : {
                        "score" : 1,
                        "_id" : 0,
                        "value" : 1
                },
                "skip" : 0
        },

whereas the mongo shell is sending

        "query" : {
                "find" : "objects",
                "filter" : {
                        "_key" : "users:joindate"
                },
                "sort" : {
                        "score" : 1,
                        "value" : 1
                },
                "projection" : {
                        "_id" : 0,
                        "value" : 1,
                        "score" : 1
                },
                "noCursorTimeout" : true
        },

It looks like the field is set here: https://github.com/mongodb/mongo/blob/r3.2.7/src/mongo/shell/query.js#L225-L227

I think mongod is looking at the bson property instead of the wire format flag.

Comment by Ben Lubar [ 05/Jul/16 ]

I updated to v3.2.7 and it fails in the same place with Cursor not found, cursor id: 28436833266 right after user 95923, just like before.

> db.serverBuildInfo()
{
        "version" : "3.2.7",
        "gitVersion" : "4249c1d2b5999ebbf1fdf3bc0e0e3b3ff5c0aaf2",
        "modules" : [ ],
        "allocator" : "tcmalloc",
        "javascriptEngine" : "mozjs",
        "sysInfo" : "deprecated",
        "versionArray" : [
                3,
                2,
                7,
                0
        ],
        "openssl" : {
                "running" : "OpenSSL 1.0.1e 11 Feb 2013",
                "compiled" : "OpenSSL 1.0.1e 11 Feb 2013"
        },
        "buildEnvironment" : {
                "distmod" : "debian71",
                "distarch" : "x86_64",
                "cc" : "/opt/mongodbtoolchain/bin/gcc: gcc (GCC) 4.8.2",
                "ccflags" : "-fno-omit-frame-pointer -fPIC -fno-strict-aliasing -ggdb -pthread -Wall -Wsign-compare -Wno-unknown-pragmas -Winvalid-pch -Werror -O2 -Wno-unused-local-typedefs -Wno-unused-function -Wno-deprecated-declarations -Wno-unused-but-set-variable -Wno-missing-braces -fno-builtin-memcmp",
                "cxx" : "/opt/mongodbtoolchain/bin/g++: g++ (GCC) 4.8.2",
                "cxxflags" : "-Wnon-virtual-dtor -Woverloaded-virtual -Wno-maybe-uninitialized -std=c++11",
                "linkflags" : "-fPIC -pthread -Wl,-z,now -rdynamic -fuse-ld=gold -Wl,-z,noexecstack -Wl,--warn-execstack",
                "target_arch" : "x86_64",
                "target_os" : "linux"
        },
        "bits" : 64,
        "debug" : false,
        "maxBsonObjectSize" : 16777216,
        "storageEngines" : [
                "devnull",
                "ephemeralForTest",
                "mmapv1",
                "wiredTiger"
        ],
        "ok" : 1
}

The storage engine in use is mmapv1, if that makes a difference.

Comment by Ramon Fernandez Marina [ 05/Jul/16 ]

I wonder if you're running into SERVER-24058, which was fixed in 3.2.7. Would it be possible for you to test against 3.2.7?

Comment by Ben Lubar [ 05/Jul/16 ]

I modified my code to sleep for 10 minutes before handling the result of the first find query. It looks like the noTimeout flag isn't being set.

> db.serverStatus().metrics.cursor
{
        "timedOut" : NumberLong(0),
        "open" : {
                "noTimeout" : NumberLong(0),
                "pinned" : NumberLong(0),
                "total" : NumberLong(1)
        }
}

Test program:

package main
 
import (
        "log"
        "net"
        "os/exec"
 
        "gopkg.in/mgo.v2"
        "gopkg.in/mgo.v2/bson"
)
 
func main() {
        ip, err := exec.Command("docker", "inspect", "-f", "{{.NetworkSettings.Networks.wtdwtf.IPAddress}}", "wtdwtf-mongo").Output()
        if err != nil {
                log.Fatal(err)
        }
 
        info, err := mgo.ParseURL(string(ip[:len(ip)-1]))
        if err != nil {
                log.Fatal(err)
        }
 
        info.DialServer = func(addr *mgo.ServerAddr) (net.Conn, error) {
                c, err := net.DialTCP("tcp", nil, addr.TCPAddr())
                if err != nil {
                        return nil, err
                }
 
                return &conn{c}, nil
        }
 
        mongo, err := mgo.DialWithInfo(info)
        if err != nil {
                log.Fatal(err)
        }
        defer mongo.Close()
 
        mongo.SetCursorTimeout(0)
 
        log.Println("before Find")
 
        it := mongo.DB("0").C("objects").
                Find(bson.M{"_key": "users:joindate"}).
                Sort("score", "value").
                Select(bson.M{"_id": 0, "value": 1, "score": 1}).
                Iter()
 
        log.Println("before Next")
 
        var data bson.M
        it.Next(&data)
 
        log.Println("before Close")
        if err := it.Close(); err != nil {
                log.Fatal(err)
        }
 
        log.Println("Done")
}
 
type conn struct {
        *net.TCPConn
}
 
func (c *conn) Write(b []byte) (n int, err error) {
        n, err = c.TCPConn.Write(b)
        log.Printf("WRITE: %x", b[:n])
        return
}
 
func (c *conn) Read(b []byte) (n int, err error) {
        n, err = c.TCPConn.Read(b)
        log.Printf("READ: %x", b[:n])
        return
}

Output:

2016/07/05 17:18:21 WRITE: 3a0000000100000000000000d40700000000000061646d696e2e24636d640000000000ffffffff13000000106765746e6f6e6365000100000000
2016/07/05 17:18:21 READ: 510000006f00000001000000010000000800000000000000000000000000000001000000
2016/07/05 17:18:21 WRITE: 3a0000000300000000000000d40700000400000061646d696e2e24636d640000000000ffffffff130000001069736d6173746572000100000000
2016/07/05 17:18:21 READ: 2d000000
2016/07/05 17:18:21 READ: 026e6f6e636500110000003630333834393066303137643532313600016f6b00000000000000f03f00
2016/07/05 17:18:21 READ: c20000007000000003000000010000000800000000000000000000000000000001000000
2016/07/05 17:18:21 READ: 9e000000
2016/07/05 17:18:21 READ: 0869736d61737465720001106d617842736f6e4f626a65637453697a650000000001106d61784d65737361676553697a65427974657300006cdc02106d61785772697465426174636853697a6500e8030000096c6f63616c54696d6500ad0124bd55010000106d61785769726556657273696f6e0004000000106d696e5769726556657273696f6e0000000000016f6b00000000000000f03f00
2016/07/05 17:18:21 WRITE: 360000000500000000000000d40700000400000061646d696e2e24636d640000000000ffffffff0f0000001070696e67000100000000
2016/07/05 17:18:21 READ: 350000007100000005000000010000000800000000000000000000000000000001000000
2016/07/05 17:18:21 READ: 11000000
2016/07/05 17:18:21 READ: 016f6b00000000000000f03f00
2016/07/05 17:18:21 before Find
2016/07/05 17:18:21 WRITE: df0000000700000000000000d407000010000000302e24636d640000000000ffffffff980000000266696e6400080000006f626a65637473000366696c746572001e000000025f6b6579000f00000075736572733a6a6f696e64617465000003736f7274001b0000001073636f726500010000001076616c75650001000000000370726f6a656374696f6e0024000000105f696400000000001076616c756500010000001073636f726500010000000010736b6970000000000000240000001076616c756500010000001073636f72650001000000105f6964000000000000
2016/07/05 17:18:21 before Next
2016/07/05 17:18:22 READ: 710f00007200000007000000010000000800000000000000000000000000000001000000
2016/07/05 17:18:22 READ: 4d0f0000
2016/07/05 17:18:22 READ: 127761697465644d5300000000000000000003637572736f7200220f0000046669727374426174636800f30e0000033000210000000276616c7565000200000031000173636f7265000060848d3535754200033100210000000276616c7565000200000032000173636f72650000600ea14535754200033200210000000276616c7565000200000033000173636f726500001010a14535754200033300210000000276616c7565000200000034000173636f72650000c011a14535754200033400210000000276616c7565000200000035000173636f726500008013a14535754200033500210000000276616c7565000200000036000173636f72650000f014a14535754200033600210000000276616c7565000200000037000173636f726500000016a14535754200033700210000000276616c7565000200000038000173636f726500006017a14535754200033800210000000276616c7565000200000039000173636f72650000c019a14535754200033900220000000276616c756500030000003130000173636f72650000301ba1453575420003313000220000000276616c756500030000003131000173636f72650000001ca1453575420003313100220000000276616c756500030000003132000173636f72650000201da1453575420003313200220000000276616c756500030000003133000173636f72650000301ea1453575420003313300220000000276616c756500030000003134000173636f72650000601fa1453575420003313400220000000276616c756500030000003135000173636f726500000022a1453575420003313500220000000276616c756500030000003136000173636f726500000023a1453575420003313600220000000276616c756500030000003137000173636f726500004025a1453575420003313700220000000276616c756500030000003138000173636f726500005026a1453575420003313800220000000276616c756500030000003139000173636f726500004027a1453575420003313900220000000276616c756500030000003230000173636f726500003028a1453575420003323000220000000276616c756500030000003231000173636f72650000802aa1453575420003323100220000000276616c756500030000003232000173636f72650000a02ba1453575420003323200220000000276616c756500030000003233000173636f72650000b02ca1453575420003323300220000000276616c756500030000003234000173636f72650000d02da1453575420003323400220000000276616c756500030000003235000173636f72650000c02ea1453575420003323500220000000276616c756500030000003236000173636f726500000030a1453575420003323600220000000276616c756500030000003237000173636f726500000031a1453575420003323700220000000276616c756500030000003238000173636f726500000032a1453575420003323800220000000276616c756500030000003239000173636f72650000c033a1453575420003323900220000000276616c756500030000003330000173636f72650000c034a1453575420003333000220000000276616c756500030000003331000173636f72650000b035a1453575420003333100220000000276616c756500030000003332000173636f72650000b036a1453575420003333200220000000276616c756500030000003333000173636f72650000b037a1453575420003333300220000000276616c756500030000003334000173636f72650000a038a1453575420003333400220000000276616c756500030000003335000173636f72650000a039a1453575420003333500220000000276616c756500030000003336000173636f72650000a03aa1453575420003333600220000000276616c756500030000003337000173636f72650000903ba1453575420003333700220000000276616c756500030000003338000173636f72650000803ca1453575420003333800220000000276616c756500030000003339000173636f72650000b03da1453575420003333900220000000276616c756500030000003430000173636f72650000b03fa1453575420003343000220000000276616c756500030000003431000173636f72650000a040a1453575420003343100220000000276616c756500030000003432000173636f726500009041a1453575420003343200220000000276616c756500030000003433000173636f726500008042a1453575420003343300220000000276616c756500030000003434000173636f726500008043a1453575420003343400220000000276616c756500030000003435000173636f726500008044a1453575420003343500220000000276616c756500030000003436000173636f72650000a045a1453575420003343600220000000276616c756500030000003437000173636f72650000a046a1453575420003343700220000000276616c756500030000003438000173636f72650000a047a1453575420003343800220000000276616c756500030000003439000173636f726500009048a1453575420003343900220000000276616c756500030000003530000173636f726500009049a1453575420003353000220000000276616c756500030000003531000173636f72650000804aa1453575420003353100220000000276616c756500030000003532000173636f72650000b04ba1453575420003353200220000000276616c756500030000003533000173636f72650000a04ca1453575420003353300220000000276616c756500030000003534000173636f72650000b04da1453575420003353400220000000276616c756500030000003535000173636f72650000b04ea1453575420003353500220000000276616c756500030000003536000173636f72650000b04fa1453575420003353600220000000276616c756500030000003537000173636f72650000b050a1453575420003353700220000000276616c756500030000003538000173636f72650000b051a1453575420003353800220000000276616c756500030000003539000173636f72650000b052a1453575420003353900220000000276616c756500030000003630000173636f726500009054a1453575420003363000220000000276616c756500030000003631000173636f726500009055a1453575420003363100220000000276616c756500030000003632000173636f726500008056a1453575420003363200220000000276616c756500030000003633000173636f726500008057a1453575420003363300220000000276616c756500030000003634000173636f726500008058a1453575420003363400220000000276616c756500030000003635000173636f72650000505aa1453575420003363500220000000276616c756500030000003636000173636f72650000405ba1453575420003363600220000000276616c756500030000003637000173636f72650000405ca1453575420003363700220000000276616c756500030000003638000173636f72650000305da1453575420003363800220000000276616c756500030000003639000173636f72650000205ea1453575420003363900220000000276616c756500030000003730000173636f72650000105fa1453575420003373000220000000276616c756500030000003731000173636f726500001060a1453575420003373100220000000276616c756500030000003732000173636f726500001061a1453575420003373200220000000276616c756500030000003733000173636f726500002062a1453575420003373300220000000276616c756500030000003734000173636f726500002063a1453575420003373400220000000276616c756500030000003735000173636f726500001064a1453575420003373500220000000276616c756500030000003736000173636f726500000065a1453575420003373600220000000276616c756500030000003737000173636f726500000066a1453575420003373700220000000276616c756500030000003738000173636f72650000f066a1453575420003373800220000000276616c756500030000003739000173636f726500005068a1453575420003373900220000000276616c756500030000003830000173636f726500009069a1453575420003383000220000000276616c756500030000003831000173636f72650000906aa1453575420003383100220000000276616c756500030000003832000173636f72650000806ba1453575420003383200220000000276616c756500030000003833000173636f72650000806ca1453575420003383300220000000276616c756500030000003834000173636f72650000906da1453575420003383400220000000276616c756500030000003835000173636f72650000806ea1453575420003383500220000000276616c756500030000003836000173636f72650000706fa1453575420003383600220000000276616c756500030000003837000173636f726500005070a1453575420003383700220000000276616c756500030000003838000173636f726500004071a1453575420003383800220000000276616c756500030000003839000173636f726500003072a1453575420003383900220000000276616c756500030000003930000173636f726500005073a1453575420003393000220000000276616c756500030000003931000173636f726500004074a1453575420003393100220000000276616c756500030000003932000173636f726500003075a1453575420003393200220000000276616c756500030000003933000173636f726500001076a1453575420003393300220000000276616c756500030000003934000173636f726500000077a1453575420003393400220000000276616c756500030000003935000173636f72650000e077a1453575420003393500220000000276616c756500030000003936000173636f72650000d078a1453575420003393600220000000276616c756500030000003937000173636f72650000d079a1453575420003393700220000000276616c756500030000003938000173636f72650000c07aa1453575420003393800220000000276616c756500030000003939000173636f72650000b07ba1453575420003393900230000000276616c75650004000000313030000173636f72650000907da145357542000331303000230000000276616c75650004000000313031000173636f72650000807ea14535754200001269640023fa388006000000026e73000a000000302e6f626a656374730000016f6b00000000000000f03f00
2016/07/05 17:18:22 before Close
2016/07/05 17:18:22 WRITE: 200000000000000000000000d7070000000000000100000023fa388006000000
2016/07/05 17:18:22 Done

That WRITE between before Find and before Next:

00000000: df00 0000 0700 0000 0000 0000 d407 0000  ................
00000010: 1000 0000 302e 2463 6d64 0000 0000 00ff  ....0.$cmd......
00000020: ffff ff98 0000 0002 6669 6e64 0008 0000  ........find....
00000030: 006f 626a 6563 7473 0003 6669 6c74 6572  .objects..filter
00000040: 001e 0000 0002 5f6b 6579 000f 0000 0075  ......_key.....u
00000050: 7365 7273 3a6a 6f69 6e64 6174 6500 0003  sers:joindate...
00000060: 736f 7274 001b 0000 0010 7363 6f72 6500  sort......score.
00000070: 0100 0000 1076 616c 7565 0001 0000 0000  .....value......
00000080: 0370 726f 6a65 6374 696f 6e00 2400 0000  .projection.$...
00000090: 105f 6964 0000 0000 0010 7661 6c75 6500  ._id......value.
000000a0: 0100 0000 1073 636f 7265 0001 0000 0000  .....score......
000000b0: 1073 6b69 7000 0000 0000 0024 0000 0010  .skip......$....
000000c0: 7661 6c75 6500 0100 0000 1073 636f 7265  value......score
000000d0: 0001 0000 0010 5f69 6400 0000 0000 00    ......_id......

Which decodes to (decoded by hand using https://docs.mongodb.com/manual/reference/mongodb-wire-protocol/):

  • message length: 223 bytes
  • request ID: 7
  • response to: 0
  • request type: OP_QUERY
  • flags: 0x00000010 NoCursorTimeout
  • full collection name: 0.$cmd
  • number to skip: 0
  • number to return: -1
  • {
            "find": "objects",
            "filter": {
                    "_key": "users:joindate"
            },
            "sort": {
                    "score": 1,
                    "value": 1
            },
            "projection": {
                    "_id": 0,
                    "value": 1,
                    "score": 1
            },
            "skip": 0
    }
    

  • {
            "value": 1,
            "score": 1,
            "_id": 0
    }
    

The cursor is not counted in the noTimeout field of db.serverStatus().metrics.cursor.open, but it is counted in the total field.

Comment by Ben Lubar [ 05/Jul/16 ]

Reposted here: http://stackoverflow.com/questions/38211363/mongodb-cursor-not-found-even-with-no-timeout

Comment by Ramon Fernandez Marina [ 05/Jul/16 ]

BenLubar, this question would be best posted in the mongodb-user group or Stack Overflow with the mongodb tag, where it will reach a larger audience. If you think you've found a but in the SERVER we'll need a way to reproduce it locally, preferably in the shell to make sure that this is not a driver-related issue either. See also our Technical Support page if you need additional support resources.

Regards,
Ramón.

Generated at Thu Feb 08 04:07:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.