-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.6.0-rc2
-
Component/s: Aggregation Framework
-
None
-
ALL
When running a long aggregation against a 3 shard cluster, with "allowDiskUse" and "$out" set, the operation eventually fails with the following error
mongos> db.transactions.aggregate( [ <some large grouping> ,{ $out: "outputCollection" } ], { allowDiskUse: true } ); assert: command failed: { "errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or time out?", "code" : 13127, "ok" : 0 } : aggregate failed Error: command failed: { "errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or timeout?", "code" : 13127, "ok" : 0 } : aggregate failed at Error (<anonymous>) at doassert (src/mongo/shell/assert.js:11:14) at Function.assert.commandWorked (src/mongo/shell/assert.js:244:5) at DBCollection.aggregate (src/mongo/shell/collection.js:1149:12) at (shell):1:17 2014-03-26T10:20:41.044+0000 Error: command failed: { "errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or timeout?", "code" : 13127, "ok" : 0 } : aggregate failed at src/mongo/shell/assert.js:13
The operation does not seem to fail after 10min from the shell but after a much longer time, I will try to time it.
Looking on mongos logs, the only relevant line is:
2014-03-26T10:20:40.979+0000 [conn2] command mydb.$cmd command: aggregate { aggregate: " transactions", pipeline: [ { $mergeCursors: [ { host: "MongoDBLinux-1:27017", id: 47309755740 } , { host: "MongoDBLinux-2:27017", id: 35067308903 }, { host: "MongoDBLinux-3:27017", id: 273888 85734 } ] }, { $group: { _id: "$$ROOT._id", Terminal: { $first: "$$ROOT.Terminal" }, count: { $ sum: "$$ROOT.count" }, $doingMerge: true } }, { $out: "outputCollection" } ], allowDiskUse: true, cursor: {} } keyUpdates:0 numYields:0 locks(micros) r:244 reslen:139 17850228ms
The log above happens about 5h after the operation was launched from the shell.
There are some lines mentioning "killing cursor" but they seem unrelated and happen more often.
Looking at mongod logs, there are no lines mentioning "killing cursor" nor "aggregate".
This is quite problematic since it makes it unusable for long aggregations.
I will try to disable cursor timeout in the query to see if it makes a difference.
My wild guess is that this error happens if 1 shard finishes its job more than 10min after another shard has finished, or smthing like that.
- duplicates
-
SERVER-6036 Disable cursor timeout for cursors that belong to a session
- Closed