Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-13358

long aggregation queries get a cursor timeout error

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.6.0-rc2
    • Component/s: Aggregation Framework
    • Labels:
      None
    • ALL

      When running a long aggregation against a 3 shard cluster, with "allowDiskUse" and "$out" set, the operation eventually fails with the following error

      mongos> db.transactions.aggregate(  [  <some large grouping> ,{ $out: "outputCollection"      } ], { allowDiskUse: true }  );
      assert: command failed: {
              "errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or time
      out?",
              "code" : 13127,
              "ok" : 0
      } : aggregate failed
      Error: command failed: {
              "errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or timeout?",  
              "code" : 13127,
              "ok" : 0
      } : aggregate failed
          at Error (<anonymous>)
          at doassert (src/mongo/shell/assert.js:11:14)
          at Function.assert.commandWorked (src/mongo/shell/assert.js:244:5)
          at DBCollection.aggregate (src/mongo/shell/collection.js:1149:12)
          at (shell):1:17
      2014-03-26T10:20:41.044+0000 Error: command failed: {
              "errmsg" : "exception: getMore: cursor didn't exist on server, possible restart or timeout?",  
              "code" : 13127,
              "ok" : 0
      } : aggregate failed at src/mongo/shell/assert.js:13
      

      The operation does not seem to fail after 10min from the shell but after a much longer time, I will try to time it.

      Looking on mongos logs, the only relevant line is:

      2014-03-26T10:20:40.979+0000 [conn2] command mydb.$cmd command: aggregate { aggregate: "
      transactions", pipeline: [ { $mergeCursors: [ { host: "MongoDBLinux-1:27017", id: 47309755740 }
      , { host: "MongoDBLinux-2:27017", id: 35067308903 }, { host: "MongoDBLinux-3:27017", id: 273888
      85734 } ] }, { $group: { _id: "$$ROOT._id", Terminal: { $first: "$$ROOT.Terminal" }, count: { $
      sum: "$$ROOT.count" }, $doingMerge: true } }, { $out: "outputCollection" } ], allowDiskUse: 
      true, cursor: {} } keyUpdates:0 numYields:0 locks(micros) r:244 reslen:139 17850228ms
      

      The log above happens about 5h after the operation was launched from the shell.
      There are some lines mentioning "killing cursor" but they seem unrelated and happen more often.

      Looking at mongod logs, there are no lines mentioning "killing cursor" nor "aggregate".

      This is quite problematic since it makes it unusable for long aggregations.
      I will try to disable cursor timeout in the query to see if it makes a difference.
      My wild guess is that this error happens if 1 shard finishes its job more than 10min after another shard has finished, or smthing like that.

            Assignee:
            mathias@mongodb.com Mathias Stearn
            Reporter:
            antoine Antoine Girbal
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: