Details
-
Bug
-
Resolution: Duplicate
-
Major - P3
-
None
-
2.6.0-rc2
-
None
-
ALL
Description
When running a long aggregation against a 3 shard cluster, with "allowDiskUse" and "$out" set, the operation eventually fails with the following error
mongos> db.transactions.aggregate( [ <some large grouping> ,{ $out: "outputCollection" } ], { allowDiskUse: true } );
|
assert: command failed: {
|
"errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or time
|
out?",
|
"code" : 13127,
|
"ok" : 0
|
} : aggregate failed
|
Error: command failed: {
|
"errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or timeout?",
|
"code" : 13127,
|
"ok" : 0
|
} : aggregate failed
|
at Error (<anonymous>)
|
at doassert (src/mongo/shell/assert.js:11:14)
|
at Function.assert.commandWorked (src/mongo/shell/assert.js:244:5)
|
at DBCollection.aggregate (src/mongo/shell/collection.js:1149:12)
|
at (shell):1:17
|
2014-03-26T10:20:41.044+0000 Error: command failed: {
|
"errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or timeout?",
|
"code" : 13127,
|
"ok" : 0
|
} : aggregate failed at src/mongo/shell/assert.js:13
|
The operation does not seem to fail after 10min from the shell but after a much longer time, I will try to time it.
Looking on mongos logs, the only relevant line is:
2014-03-26T10:20:40.979+0000 [conn2] command mydb.$cmd command: aggregate { aggregate: "
|
transactions", pipeline: [ { $mergeCursors: [ { host: "MongoDBLinux-1:27017", id: 47309755740 }
|
, { host: "MongoDBLinux-2:27017", id: 35067308903 }, { host: "MongoDBLinux-3:27017", id: 273888
|
85734 } ] }, { $group: { _id: "$$ROOT._id", Terminal: { $first: "$$ROOT.Terminal" }, count: { $
|
sum: "$$ROOT.count" }, $doingMerge: true } }, { $out: "outputCollection" } ], allowDiskUse:
|
true, cursor: {} } keyUpdates:0 numYields:0 locks(micros) r:244 reslen:139 17850228ms
|
The log above happens about 5h after the operation was launched from the shell.
There are some lines mentioning "killing cursor" but they seem unrelated and happen more often.
Looking at mongod logs, there are no lines mentioning "killing cursor" nor "aggregate".
This is quite problematic since it makes it unusable for long aggregations.
I will try to disable cursor timeout in the query to see if it makes a difference.
My wild guess is that this error happens if 1 shard finishes its job more than 10min after another shard has finished, or smthing like that.
Attachments
Issue Links
- duplicates
-
SERVER-6036 Disable cursor timeout for cursors that belong to a session
-
- Closed
-