An operation on an awaitData tailable cursor can fail with "operation exceeded time limit" even if maxTimeMS is not set

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Duplicate
    • Priority: Major - P3
    • None
    • Affects Version/s: 3.2.11, 3.2.18
    • Component/s: Querying
    • Query
    • ALL
    • Hide

      I use MongoDB v3.2.11/3.2.18 on a VM with 1 GB of RAM with no swap configured. The goal is to punch the server hard enough for performance to degrade significantly but before it OOMs or gets crippled hard enough so sockets start to time out. This is how I do it:

      • I start my Java code (I thought it would be easier to reproduce with a larger number of threads but that didn't appear to be the case)
      • I do a mongorestore of a reasonable small collection (the dump size is ~50 MB). I let the first execution to complete, then run it again interrupting half-way

      Once the server is in the required state the exception can be encountered easily.

      Show
      I use MongoDB v3.2.11/3.2.18 on a VM with 1 GB of RAM with no swap configured. The goal is to punch the server hard enough for performance to degrade significantly but before it OOMs or gets crippled hard enough so sockets start to time out. This is how I do it: I start my Java code (I thought it would be easier to reproduce with a larger number of threads but that didn't appear to be the case) I do a mongorestore of a reasonable small collection (the dump size is ~50 MB). I let the first execution to complete, then run it again interrupting half-way Once the server is in the required state the exception can be encountered easily.
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None

      In my repro this happens when the server is struggling. The query looks like:

      2017-12-15T17:38:37.527+1100 D COMMAND  [conn8516] run command local.$cmd { find: "oplog.rs", filter: { op: { $in: [ "i", "u" ] }, ts: { $gte: Timestamp 1491415276000|0 } }, sort: { $natural: 1 }, batchSize: 4000, tailable: true, awaitData: true, oplogReplay: true, noCursorTimeout: true }
      

      Sometimes (unfortunately my repro isn't consistent and the outcome is non-deterministic) a getMore fails like:

      2017-12-19T16:55:39.400+1100 I COMMAND  [conn103] command local.oplog.rs command: getMore { getMore: 21203342251, collection: "oplog.rs", batchSize: 4000 } cursorid:21203342251 keyUpdates:0 writeConflicts:0 exception: operation exceeded time limit code:50 numYields:0 reslen:89 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, oplog: { acquireCount: { r: 1 } } } protocol:op_query 22401ms
      

      The chance of getting an exception doesn't seem to be depending on the duration of the getMore - sometimes I see them taking much longer, but no exception is thrown:

      2017-12-14T16:11:04.323+1100 I COMMAND  [conn29] command local.oplog.rs command: getMore { getMore: 18711090531, collection: "oplog.rs", batchSize: 4000 } planSummary: COLLSCAN cursorid:18711090531 keysExamined:0 docsExamined:213 keyUpdates:0 writeConflicts:0 numYields:2 nreturned:213 reslen:28533 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } protocol:op_query 109834ms
      

            Assignee:
            Backlog - Query Team (Inactive)
            Reporter:
            Dmitry Ryabtsev
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: