Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-32399

An operation on an awaitData tailable cursor can fail with "operation exceeded time limit" even if maxTimeMS is not set

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 3.2.11, 3.2.18
    • Fix Version/s: None
    • Component/s: Querying
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      I use MongoDB v3.2.11/3.2.18 on a VM with 1 GB of RAM with no swap configured. The goal is to punch the server hard enough for performance to degrade significantly but before it OOMs or gets crippled hard enough so sockets start to time out. This is how I do it:

      • I start my Java code (I thought it would be easier to reproduce with a larger number of threads but that didn't appear to be the case)
      • I do a mongorestore of a reasonable small collection (the dump size is ~50 MB). I let the first execution to complete, then run it again interrupting half-way

      Once the server is in the required state the exception can be encountered easily.

      Show
      I use MongoDB v3.2.11/3.2.18 on a VM with 1 GB of RAM with no swap configured. The goal is to punch the server hard enough for performance to degrade significantly but before it OOMs or gets crippled hard enough so sockets start to time out. This is how I do it: I start my Java code (I thought it would be easier to reproduce with a larger number of threads but that didn't appear to be the case) I do a mongorestore of a reasonable small collection (the dump size is ~50 MB). I let the first execution to complete, then run it again interrupting half-way Once the server is in the required state the exception can be encountered easily.
    • Case:

      Description

      In my repro this happens when the server is struggling. The query looks like:

      2017-12-15T17:38:37.527+1100 D COMMAND  [conn8516] run command local.$cmd { find: "oplog.rs", filter: { op: { $in: [ "i", "u" ] }, ts: { $gte: Timestamp 1491415276000|0 } }, sort: { $natural: 1 }, batchSize: 4000, tailable: true, awaitData: true, oplogReplay: true, noCursorTimeout: true }
      

      Sometimes (unfortunately my repro isn't consistent and the outcome is non-deterministic) a getMore fails like:

      2017-12-19T16:55:39.400+1100 I COMMAND  [conn103] command local.oplog.rs command: getMore { getMore: 21203342251, collection: "oplog.rs", batchSize: 4000 } cursorid:21203342251 keyUpdates:0 writeConflicts:0 exception: operation exceeded time limit code:50 numYields:0 reslen:89 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, oplog: { acquireCount: { r: 1 } } } protocol:op_query 22401ms
      

      The chance of getting an exception doesn't seem to be depending on the duration of the getMore - sometimes I see them taking much longer, but no exception is thrown:

      2017-12-14T16:11:04.323+1100 I COMMAND  [conn29] command local.oplog.rs command: getMore { getMore: 18711090531, collection: "oplog.rs", batchSize: 4000 } planSummary: COLLSCAN cursorid:18711090531 keysExamined:0 docsExamined:213 keyUpdates:0 writeConflicts:0 numYields:2 nreturned:213 reslen:28533 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } protocol:op_query 109834ms
      

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              backlog-server-query Backlog - Query Team
              Reporter:
              dmitry.ryabtsev Dmitry Ryabtsev
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: