[SERVER-32399] An operation on an awaitData tailable cursor can fail with "operation exceeded time limit" even if maxTimeMS is not set Created: 19/Dec/17  Updated: 06/Dec/22  Resolved: 28/Jun/19

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 3.2.11, 3.2.18
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dmitry Ryabtsev Assignee: Backlog - Query Team (Inactive)
Resolution: Duplicate Votes: 0
Labels: query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-33942 getMore with maxTimeMS returns "opera... Closed
Related
Assigned Teams:
Query
Operating System: ALL
Steps To Reproduce:

I use MongoDB v3.2.11/3.2.18 on a VM with 1 GB of RAM with no swap configured. The goal is to punch the server hard enough for performance to degrade significantly but before it OOMs or gets crippled hard enough so sockets start to time out. This is how I do it:

  • I start my Java code (I thought it would be easier to reproduce with a larger number of threads but that didn't appear to be the case)
  • I do a mongorestore of a reasonable small collection (the dump size is ~50 MB). I let the first execution to complete, then run it again interrupting half-way

Once the server is in the required state the exception can be encountered easily.

Participants:
Case:

 Description   

In my repro this happens when the server is struggling. The query looks like:

2017-12-15T17:38:37.527+1100 D COMMAND  [conn8516] run command local.$cmd { find: "oplog.rs", filter: { op: { $in: [ "i", "u" ] }, ts: { $gte: Timestamp 1491415276000|0 } }, sort: { $natural: 1 }, batchSize: 4000, tailable: true, awaitData: true, oplogReplay: true, noCursorTimeout: true }

Sometimes (unfortunately my repro isn't consistent and the outcome is non-deterministic) a getMore fails like:

2017-12-19T16:55:39.400+1100 I COMMAND  [conn103] command local.oplog.rs command: getMore { getMore: 21203342251, collection: "oplog.rs", batchSize: 4000 } cursorid:21203342251 keyUpdates:0 writeConflicts:0 exception: operation exceeded time limit code:50 numYields:0 reslen:89 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, oplog: { acquireCount: { r: 1 } } } protocol:op_query 22401ms

The chance of getting an exception doesn't seem to be depending on the duration of the getMore - sometimes I see them taking much longer, but no exception is thrown:

2017-12-14T16:11:04.323+1100 I COMMAND  [conn29] command local.oplog.rs command: getMore { getMore: 18711090531, collection: "oplog.rs", batchSize: 4000 } planSummary: COLLSCAN cursorid:18711090531 keysExamined:0 docsExamined:213 keyUpdates:0 writeConflicts:0 numYields:2 nreturned:213 reslen:28533 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } protocol:op_query 109834ms



 Comments   
Comment by David Storch [ 28/Jun/19 ]

This problem was fixed in development version 3.7.4 under SERVER-33942. The fix is currently available in 4.0.x stable release and in 4.2 release candidates. I'm resolving this ticket as a duplicate.

Generated at Thu Feb 08 04:30:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.