[GODRIVER-2944] Support CSOT spec timeoutMode for non-tailable cursors Created: 14/Aug/23  Updated: 26/Sep/23

Status: Scheduled
Project: Go Driver
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Rohan Sharan Assignee: Steve Silvester
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to GODRIVER-2622 Stabilize CSOT Closed
related to DRIVERS-2722 Change CSOT default cursor timeout mo... Needs Triage

 Description   

The CSOT spec mentions a timeoutMode option on non-tailable cursors that makes it so that the timeout is not set cumulatively on all operations resulting from something like a Find, but instead individually on each initial operation and the follow up getMore commands: https://github.com/mongodb/specifications/blob/master/source/client-side-operations-timeout/client-side-operations-timeout.rst#non-tailable-cursor-behavior 

 

Mongosync had a problem with this in HELP-47315, where it set the timeout to 5 minutes by default, which was too short for the whole Find operation (including the getMore commands) to finish. The TAR team currently has REP-3079 filed to mitigate the issue, but adhering to the CSOT spec would be preferable. The failing log in the mongod server logs is the following:

{"t":{"$date":"2023-08-10T08:01:37.411+00:00"},"s":"W",  "c":"QUERY",    "id":20478,   "ctx":"conn7899290","msg":"getMore command executor error","attr":{"error":{"code":50,"codeName":"MaxTimeMSExpired","errmsg":"operation exceeded time limit"},"stats":{"stage":"FETCH","filter":{"$and":[{"$expr":{"$gte":["$_id",{"$const":{"$oid":"60a5a031b46b1000131feef3"}}]}},{"$expr":{"$lte":["$_id",{"$const":{"$oid":"63c70228caa93700121126f7"}}]}}]},"nReturned":15402655,"works":15402655,"advanced":15402655,"needTime":0,"needYield":0,"saveState":32929,"restoreState":32928,"isEOF":0,"docsExamined":15402655,"alreadyHasObj":0,"inputStage":{"stage":"IXSCAN","nReturned":15402655,"works":15402655,"advanced":15402655,"needTime":0,"needYield":0,"saveState":32929,"restoreState":32928,"isEOF":0,"keyPattern":{"_id":1},"indexName":"_id_","isMultiKey":false,"multiKeyPaths":{"_id":[]},"isUnique":true,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"_id":["[ObjectId('60a5a031b46b1000131feef3'), ObjectId('63c70228caa93700121126f7')]"]},"keysExamined":15402655,"seeks":1,"dupsTested":0,"dupsDropped":0}},"cmd":{"getMore":7958325070821658687,"collection":"assumptions"}}} 



 Comments   
Comment by Matt Dale [ 13/Sep/23 ]

Created DRIVERS-2722 to recommend amending the CSOT spec.

Comment by Steve Silvester [ 12/Sep/23 ]

We don't yet have a drivers ticket.  A related ticket is GODRIVER-2622.  The owner of the CSOT spec is currently working on a high priority task for the department.

Comment by Tim Fogarty [ 07/Sep/23 ]

Thank you for looking into this matt.dale@mongodb.com. We have been forced to implement a messy and imperfect workaround (REP-3079) in mongosync because we cannot set timeoutMode=ITERATION. I would love for us to remove the imperfect workaround asap. So just want to get an idea of what the next steps are here and what you think the ETA might be for this?

Comment by Matt Dale [ 06/Sep/23 ]

Answering questions:

Should we throw an error when a user tires to use getMore with a context with a larger timeout than the client's timeout?

No. We'd only be avoiding the most detectable cases of inconsistencies between maxTimeMS applied on a find/aggregate and an operation timeout applied on a getMore, but there are all sorts of other timing conditions that could lead to confusion that we couldn't easily detect. Instead, it makes more sense to not default to using timeoutMS to limit cursor lifetimes.

Should we implement the timoutMode logic implicitly? i.e. use the iteration logic by default to avoid knobs?

Yes.

Should we just note the limitation mentioned in this ticket in documentation? I.e. that the client timeout is cumulative on cursors?

Documenting it would reduce confusion, but users need some way to not implicitly set a cursor lifetime. Currently it doesn't seem like there's a way to avoid setting a cursor lifetime when performing a find/aggregate without implementing timeoutMode or changing the default behavior.

Comment by Preston Vasquez [ 31/Aug/23 ]

Notes from sync:

  • Should we throw an error when a user tires to use getMore with a context with a larger timeout than the client's timeout?
  • Should we implement the timoutMode logic implicitly? i.e. use the iteration logic by default to avoid knobs?
  • Should we just note the limitation mentioned in this ticket in documentation? I.e. that the client timeout is cumulative on cursors?
Comment by Preston Vasquez [ 30/Aug/23 ]

steve.silvester@mongodb.com shane.harvey@mongodb.com My understanding of why the Go Driver did not implement timeoutMode was because we could rely on contexts to time out cursor iterations, as with Python. However, Rohan brings up a good point here in that the operations will cumulatively share a timeout set on the client. Here is a gist written in go that illustrates this problem: https://gist.github.com/prestonvasquez/24f073cf8e4a0ffbe8f1dbc738a6aa6c

Should we make timeoutMode a drivers-wide requirement instead of optional?

Comment by Preston Vasquez [ 30/Aug/23 ]

rohan.sharan@mongodb.com ah I see. I am going to put this ticket back into triage to (1) sync with python, as they have a similar implementation, and (2) if this is something the Go Driver team has the bandwidth to address.

Comment by Rohan Sharan [ 30/Aug/23 ]

preston.vasquez@mongodb.com The timeout that we're trying to protect against is the following:

maxTimeMS is cumulative time that the server spends processing the original operation as well as any following getMore commands (not including the time in between getMore command processing)

That means that we're not talking about timeouts on the client side, but on the server side. As far as I can tell, your example is sleeping on the client side, so the server probably is able to process all events within 100ms. 

Comment by Preston Vasquez [ 29/Aug/23 ]

rohan.sharan@mongodb.com This example to help illustrate how the client’s timeout shouldn’t effect additional “getMore” requests: https://gist.github.com/prestonvasquez/2b745e8c9de91c94e90ac18d235f1ef5

Does this resemble your use case?

Generated at Thu Feb 08 08:39:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.