Loading...

XML

Word

Printable

JSON

Type: Spec Change
Resolution: Unresolved
Priority: Unknown
Fix Version/s: None
Component/s: CSOT
Labels:
- need-secondary
- spec-design

Epic Link:
CSOT GA
Driver Changes:
Needed
Quarter:
- FY26Q4-candidate
Downstream Changes Summary:
Hide

Summary of necessary driver changes

Commits for syncing spec/prose tests
(and/or refer to an existing language POC if needed)

Context for other referenced/linked tickets
Show
Summary of necessary driver changes Commits for syncing spec/prose tests (and/or refer to an existing language POC if needed) Context for other referenced/linked tickets

Summary

TimeoutMS is refreshed for killCursors when cleaning up a timed out cursor. However, this means that a cursor with LIFETIME=timeoutMS may actually live up to 2 * timeoutMS. This seems like an acceptable experience for users who are creating cursors themselves - its better to ensure the cursor is cleaned up than to leak cursors on the server. However, there are places in the driver where cursors are used internally (auto encryption, the client bulk write API, gridfs) and the cursors are considered implementation details. This means that, if the cursor must be killed, a driver operation may take up to 2 * timeoutMS. This will be surprising to users because they will have no idea cursors were involved.

I actually encountered this writing a client bulk write test. The following test always fails because killCursors consistently takes ~500ms against my local, standalone server:

  await configureFailPoint(this.configuration, {
    configureFailPoint: 'failCommand',
    mode: { times: 1 },
    data: { blockConnection: true, blockTimeMS: 1400, failCommands: ['getMore'] }
  });
  const models = await makeMultiResponseBatchModelArray(this.configuration); // ensures that the bulkWrite command results in a two-batch cursor
  const start = now();
  const timeoutError = await client
    .bulkWrite(models, {
      verboseResults: true,
      timeoutMS: 1500
    })
    .catch(e => e);

  const end = now();
  expect(timeoutError).to.be.instanceOf(MongoError);

  console.error(inspect(commands, { depth: Infinity }));
  expect(end - start).to.be.within(1500 - 100, 1500 + 100);
  expect(commands).to.have.lengthOf(1);

We should consider making cleanup operations non-blocking so that control can be returned to the user within timeoutMS (or as close as possible) but also to ensure that we still attempt to clean up server resources.

Motivation

Who is the affected end user?

CSOT users.

Who are the stakeholders?

How does this affect the end user?

Unpredictably, operations which use cursors internally might take significantly longer than timeoutMS to time out. This can cause confusion (I was confused by my own test originally, until I realized the killCursors was taking a long time and adding to the total timeout).

Are they blocked? Are they annoyed? Are they confused?

How likely is it that this problem or use case will occur?

Unsure.

Main path? Edge case?

If the problem does occur, what are the consequences and how severe are they?

Minor annoyance at a log message? Performance concern? Outage/unavailability? Failover can't complete?

Is this issue urgent?

No.

Does this ticket have a required timeline? What is it?

Is this ticket required by a downstream team?

No.

Needed by e.g. Atlas, Shell, Compass?

Is this ticket only for tests?

No.

Does this ticket have any functional impact, or is it just test improvements?

Acceptance Criteria

cursor cleanup should be run in a non-blocking way, so that timeout errors can be propagated to the users' application as soon as possible.

What specific requirements must be met to consider the design phase complete?

is related to

DRIVERS-2990 Clarify that drivers should kill cursors after timeout errors

Needs Triage

related to

DRIVERS-2347 Prevent conflating operation timeout with connection establishment timeout

Backlog

Assignee:: Unassigned
Reporter:: Bailey Pearson
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Oct 10 2024 05:02:19 PM UTC
Updated:: Aug 11 2025 07:00:53 PM UTC

Details

Description

Summary

Motivation

Who is the affected end user?

How does this affect the end user?

How likely is it that this problem or use case will occur?

If the problem does occur, what are the consequences and how severe are they?

Is this issue urgent?

Is this ticket required by a downstream team?

Is this ticket only for tests?

Acceptance Criteria

Attachments

Issue Links

Forms

Activity

People

Dates