[SERVER-13358] long aggregation queries get a cursor timeout error Created: 26/Mar/14  Updated: 10/Dec/14  Resolved: 27/Mar/14

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 2.6.0-rc2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Antoine Girbal Assignee: Mathias Stearn
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-6036 Disable cursor timeout for cursors th... Closed
Operating System: ALL
Participants:

 Description   

When running a long aggregation against a 3 shard cluster, with "allowDiskUse" and "$out" set, the operation eventually fails with the following error

mongos> db.transactions.aggregate(  [  <some large grouping> ,{ $out: "outputCollection"      } ], { allowDiskUse: true }  );
assert: command failed: {
        "errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or time
out?",
        "code" : 13127,
        "ok" : 0
} : aggregate failed
Error: command failed: {
        "errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or timeout?",  
        "code" : 13127,
        "ok" : 0
} : aggregate failed
    at Error (<anonymous>)
    at doassert (src/mongo/shell/assert.js:11:14)
    at Function.assert.commandWorked (src/mongo/shell/assert.js:244:5)
    at DBCollection.aggregate (src/mongo/shell/collection.js:1149:12)
    at (shell):1:17
2014-03-26T10:20:41.044+0000 Error: command failed: {
        "errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or timeout?",  
        "code" : 13127,
        "ok" : 0
} : aggregate failed at src/mongo/shell/assert.js:13

The operation does not seem to fail after 10min from the shell but after a much longer time, I will try to time it.

Looking on mongos logs, the only relevant line is:

2014-03-26T10:20:40.979+0000 [conn2] command mydb.$cmd command: aggregate { aggregate: "
transactions", pipeline: [ { $mergeCursors: [ { host: "MongoDBLinux-1:27017", id: 47309755740 }
, { host: "MongoDBLinux-2:27017", id: 35067308903 }, { host: "MongoDBLinux-3:27017", id: 273888
85734 } ] }, { $group: { _id: "$$ROOT._id", Terminal: { $first: "$$ROOT.Terminal" }, count: { $
sum: "$$ROOT.count" }, $doingMerge: true } }, { $out: "outputCollection" } ], allowDiskUse: 
true, cursor: {} } keyUpdates:0 numYields:0 locks(micros) r:244 reslen:139 17850228ms

The log above happens about 5h after the operation was launched from the shell.
There are some lines mentioning "killing cursor" but they seem unrelated and happen more often.

Looking at mongod logs, there are no lines mentioning "killing cursor" nor "aggregate".

This is quite problematic since it makes it unusable for long aggregations.
I will try to disable cursor timeout in the query to see if it makes a difference.
My wild guess is that this error happens if 1 shard finishes its job more than 10min after another shard has finished, or smthing like that.


Generated at Thu Feb 08 03:31:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.