-
Type: Task
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
Context
The current CSOT behavior of including maxTimeMS on Find and Aggregate operations when timeoutMS is set limits the server-side cursor lifetime to timeoutMS. Users don't expect an operation-level timeout to apply to cursor lifetimes, and are surprised when it does. However, there are also cases where it's important to be able to set a server-side timeout for Find and Aggregate operations so that the "background reads" feature (see GODRIVER-3172) can help prevent connection churn for those operations.
Since some customers require one or the other behaviors, we should add the timeoutMode option to Find and Aggregate operations so they can pick which behavior they want.
Open questions:
Should timeoutMode be a Client-level config or only an operation-level config?
The way the CSOT spec describes timeoutMode seems to suggest it can only be configurable at the operation level:
If timeoutMode is set to ITERATION, drivers MUST raise a client-side error if the operation is an aggregate with a $out or $merge pipeline stage.
Tailable cursors only support the ITERATION value for the timeoutMode option. This is the default value and drivers MUST error if the option is set to CURSOR_LIFETIME.
and about the Watch operation:
These helpers MUST NOT support the timeoutMode option as change streams are an abstraction around tailable-awaitData cursors, so they implicitly use ITERATION mode.
If timeoutMode can be set at the Client-level, it would be possible to create a Client with timeoutMode=ITERATION that can never run an Aggregate with an $out or $merge stage, which seems like unexpected behavior. It seems like timeoutMode should only be configurable at the operation level.
Original description:
The CSOT spec mentions a timeoutMode option on non-tailable cursors that makes it so that the timeout is not set cumulatively on all operations resulting from something like a Find, but instead individually on each initial operation and the follow up getMore commands: https://github.com/mongodb/specifications/blob/master/source/client-side-operations-timeout/client-side-operations-timeout.rst#non-tailable-cursor-behavior
Mongosync had a problem with this in HELP-47315, where it set the timeout to 5 minutes by default, which was too short for the whole Find operation (including the getMore commands) to finish. The TAR team currently has REP-3079 filed to mitigate the issue, but adhering to the CSOT spec would be preferable. The failing log in the mongod server logs is the following:
{"t":{"$date":"2023-08-10T08:01:37.411+00:00"},"s":"W", "c":"QUERY", "id":20478, "ctx":"conn7899290","msg":"getMore command executor error","attr":{"error":{"code":50,"codeName":"MaxTimeMSExpired","errmsg":"operation exceeded time limit"},"stats":{"stage":"FETCH","filter":{"$and":[{"$expr":{"$gte":["$_id",{"$const":{"$oid":"60a5a031b46b1000131feef3"}}]}},{"$expr":{"$lte":["$_id",{"$const":{"$oid":"63c70228caa93700121126f7"}}]}}]},"nReturned":15402655,"works":15402655,"advanced":15402655,"needTime":0,"needYield":0,"saveState":32929,"restoreState":32928,"isEOF":0,"docsExamined":15402655,"alreadyHasObj":0,"inputStage":{"stage":"IXSCAN","nReturned":15402655,"works":15402655,"advanced":15402655,"needTime":0,"needYield":0,"saveState":32929,"restoreState":32928,"isEOF":0,"keyPattern":{"_id":1},"indexName":"_id_","isMultiKey":false,"multiKeyPaths":{"_id":[]},"isUnique":true,"isSparse":false,"isPartial":false,"indexVersion":2,"direction":"forward","indexBounds":{"_id":["[ObjectId('60a5a031b46b1000131feef3'), ObjectId('63c70228caa93700121126f7')]"]},"keysExamined":15402655,"seeks":1,"dupsTested":0,"dupsDropped":0}},"cmd":{"getMore":7958325070821658687,"collection":"assumptions"}}}
Definition of done
- Add a timeoutMode option for Find and Aggregate that can be either "Iteration" or "CursorLifetime". The default should be "Iteration". See the CSOT Cursors section for behavior details.
Pitfalls
The recommended implementation here does not align with the current CSOT spec, but aligns with the recommended changes in DRIVERS-2722. If we decide to keep the CSOT spec the way it is, the Go Driver will behave differently than the spec and possibly other drivers.
- is related to
-
GODRIVER-3172 Read responses in the background after an operation timeout
- Closed
-
GODRIVER-3193 Don't use "background reads" when the CSOT doesn't send "maxTimeMS"
- Closed
- related to
-
GODRIVER-2622 Stabilize CSOT
- Development Complete
-
GODRIVER-3217 Allow manually specifying maxTimeMS on commands when the auto-calculated value is omitted
- Closed
-
DRIVERS-2722 Change CSOT default cursor timeout mode to ITERATION
- Backlog