Loading...

Type: Task
Resolution: Unresolved
Priority: Unknown
Fix Version/s: None
Component/s: Backpressure, Retryability
Labels:
None

Epic Link:
Client Backpressure Improvements
Driver Changes:
Needed

Summary

Implement server-side handling for retry metadata sent from drivers via the centralized telemetry channel (as defined in DRIVERS-3463), so the server can log and surface per-command retry behavior (e.g., backpressure-related retries) in a structured, queryable form.

Motivation

What is the problem or use case, what are we trying to achieve? Drivers will emit retry metadata (e.g., $retry or equivalent) through a centralized telemetry channel, but today the server does not interpret, validate, or log this information in a consistent way. We need the server to reliably consume this metadata so that retry storms and backpressure behavior are observable and actionable.

Who is the affected end user? Operators and SREs (Atlas and self-managed) and internal server/driver teams who need accurate, per-command retry data to tune backpressure, retry policies, and token bucket–style controls.

Who are the stakeholders? Core Server / Storage Engine, client-side backpressure architect, driver leads, and product/feature owners for backpressure, retryability, and telemetry.

How does this affect the end user? Without server-side handling, retry metadata sent by clients is effectively “dark data”: operators cannot see typical or worst-case retry depth, and teams can’t correlate retry behavior with overload incidents or configuration.

Are they blocked? Are they annoyed? Are they confused? This blocks the execution of DRIVERS-3462 since we need retry telemetry in 9.0 to get a sense of how to implement token buckets.

How likely is it that this problem or use case will occur? Main path? Edge case? Main path for any deployment using client backpressure and unified telemetry; every overload-related retry path is expected to emit this metadata once DRIVERS-3463 is implemented.

If the problem does occur, what are the consequences and how severe are they? We lose critical visibility into retry behavior, making it harder to safely roll out backpressure and token bucket designs, and to distinguish healthy retries from pathological retry storms.

Is this issue urgent? Yes. This is part of the Phase 2 fast-follow work for backpressure and is required to unblock downstream initiatives such as token bucket design that depend on reliable retry telemetry.

Does this ticket have a required timeline? What is it? Yes. This work should be completed in time for the 9.0 server release, alongside the telemetry channel work in DRIVERS-3463.

Is this ticket required by a downstream team? Needed by e.g. Atlas, Shell, Compass? Yes, for Atlas and server observability pipelines, and for any internal consumers that analyze retry patterns (e.g., backpressure tuning, token bucket design, incident response tooling).

Is this ticket only for tests? No. This is production server behavior for consuming and logging retry telemetry; tests are needed but are not the primary goal.

Does this ticket have any functional impact, or is it just test improvements? Functional impact: the server will parse, validate, and log retry metadata from telemetry-bearing commands and make it available to logging/metrics pipelines.

Acceptance Criteria

What specific requirements must be met to consider the design phase complete?

A server-side spec design and accompanying spec update are written and accepted that explicitly answer the following for retry telemetry:

Input / schema handling

- Define the server-side schema for retry metadata (e.g., $retry with fields like r) and how it aligns with the telemetry channel from DRIVERS-3463.
- Specify validation rules (allowed fields, value ranges, and behavior on malformed or unexpected data).
Processing and logging behavior

- Describe how retry metadata is ingested from the telemetry channel on each eligible command.
- Describe how it is logged and/or exported to metrics/observability systems.
- Explain expected behavior for multiple retries of the same command (e.g., how repeated r values appear in logs/metrics and how they should be interpreted).
Scope and gating

- Define which commands are allowed to carry retry metadata (e.g., exclusions for monitoring/handshake commands, alignment with backpressure/retry policy).
- Document wire version / feature flag gating so retry metadata is only processed when supported and when the telemetry channel from DRIVERS-3463 is enabled.
Integration and dependencies

- State explicitly that this work assumes DRIVERS-3463 has defined the centralized telemetry channel and client-side emission of retry metadata.
- Call out expected integration points with token bucket design, backpressure metrics, or other server subsystems (at a high level).
Testing and observability

- Outline a test approach that covers:
  - Commands with and without retry metadata.
  - Valid vs invalid retry metadata.
  - Typical retry loops (e.g., increments of r) and how they appear in logs/metrics.
- Provide example log/metric output formats so downstream consumers (Atlas, internal tooling) know what to expect.

blocks

DRIVERS-3465 Token Bucket retry per-server

Backlog

is blocked by

DRIVERS-3463 Implement Client-Side Telemetry Communication to the Server

Backlog

is related to

DRIVERS-3462 Implement Token bucket retry per-node

Backlog

Details

Description

Summary

Motivation

Acceptance Criteria

Attachments

Issue Links

Activity

People

Dates