-
Type: Spec Change
-
Resolution: Unresolved
-
Priority: Unknown
-
None
-
Component/s: CSOT
-
None
-
Needed
Summary
TimeoutMS is refreshed for killCursors when cleaning up a timed out cursor. However, this means that a cursor with LIFETIME=timeoutMS may actually live up to 2 * timeoutMS. This seems like an acceptable experience for users who are creating cursors themselves - its better to ensure the cursor is cleaned up than to leak cursors on the server. However, there are places in the driver where cursors are used internally (auto encryption, the client bulk write API, gridfs) and the cursors are considered implementation details. This means that, if the cursor must be killed, a driver operation may take up to 2 * timeoutMS. This will be surprising to users because they will have no idea cursors were involved.
I actually encountered this writing a client bulk write test. The following test always fails because killCursors consistently takes ~500ms against my local, standalone server:
await configureFailPoint(this.configuration, { configureFailPoint: 'failCommand', mode: { times: 1 }, data: { blockConnection: true, blockTimeMS: 1400, failCommands: ['getMore'] } }); const models = await makeMultiResponseBatchModelArray(this.configuration); // ensures that the bulkWrite command results in a two-batch cursor const start = now(); const timeoutError = await client .bulkWrite(models, { verboseResults: true, timeoutMS: 1500 }) .catch(e => e); const end = now(); expect(timeoutError).to.be.instanceOf(MongoError); console.error(inspect(commands, { depth: Infinity })); expect(end - start).to.be.within(1500 - 100, 1500 + 100); expect(commands).to.have.lengthOf(1);
We should consider making cleanup operations non-blocking so that control can be returned to the user within timeoutMS (or as close as possible) but also to ensure that we still attempt to clean up server resources.
Motivation
Who is the affected end user?
CSOT users.
Who are the stakeholders?
How does this affect the end user?
Unpredictably, operations which use cursors internally might take significantly longer than timeoutMS to time out. This can cause confusion (I was confused by my own test originally, until I realized the killCursors was taking a long time and adding to the total timeout).
Are they blocked? Are they annoyed? Are they confused?
How likely is it that this problem or use case will occur?
Unsure.
Main path? Edge case?
If the problem does occur, what are the consequences and how severe are they?
Minor annoyance at a log message? Performance concern? Outage/unavailability? Failover can't complete?
Is this issue urgent?
No.
Does this ticket have a required timeline? What is it?
Is this ticket required by a downstream team?
No.
Needed by e.g. Atlas, Shell, Compass?
Is this ticket only for tests?
No.
Does this ticket have any functional impact, or is it just test improvements?
Acceptance Criteria
- cursor cleanup should be run in a non-blocking way, so that timeout errors can be propagated to the users' application as soon as possible.
What specific requirements must be met to consider the design phase complete?
- is related to
-
DRIVERS-2990 Clarify that drivers should kill cursors after timeout errors
- Needs Triage