-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Unknown
-
None
-
Component/s: Backpressure, Retryability
-
None
-
Needed
Summary
Implement server-side handling for retry metadata sent from drivers via the centralized telemetry channel (as defined in DRIVERS-3463), so the server can log and surface per-command retry behavior (e.g., backpressure-related retries) in a structured, queryable form.
Motivation
What is the problem or use case, what are we trying to achieve? Drivers will emit retry metadata (e.g., $retry or equivalent) through a centralized telemetry channel, but today the server does not interpret, validate, or log this information in a consistent way. We need the server to reliably consume this metadata so that retry storms and backpressure behavior are observable and actionable.
Who is the affected end user? Operators and SREs (Atlas and self-managed) and internal server/driver teams who need accurate, per-command retry data to tune backpressure, retry policies, and token bucket–style controls.
Who are the stakeholders? Core Server / Storage Engine, client-side backpressure architect, driver leads, and product/feature owners for backpressure, retryability, and telemetry.
How does this affect the end user? Without server-side handling, retry metadata sent by clients is effectively “dark data”: operators cannot see typical or worst-case retry depth, and teams can’t correlate retry behavior with overload incidents or configuration.
Are they blocked? Are they annoyed? Are they confused? This blocks the execution of DRIVERS-3462 since we need retry telemetry in 9.0 to get a sense of how to implement token buckets.
How likely is it that this problem or use case will occur? Main path? Edge case? Main path for any deployment using client backpressure and unified telemetry; every overload-related retry path is expected to emit this metadata once DRIVERS-3463 is implemented.
If the problem does occur, what are the consequences and how severe are they? We lose critical visibility into retry behavior, making it harder to safely roll out backpressure and token bucket designs, and to distinguish healthy retries from pathological retry storms.
Is this issue urgent? Yes. This is part of the Phase 2 fast-follow work for backpressure and is required to unblock downstream initiatives such as token bucket design that depend on reliable retry telemetry.
Does this ticket have a required timeline? What is it? Yes. This work should be completed in time for the 9.0 server release, alongside the telemetry channel work in DRIVERS-3463.
Is this ticket required by a downstream team? Needed by e.g. Atlas, Shell, Compass? Yes, for Atlas and server observability pipelines, and for any internal consumers that analyze retry patterns (e.g., backpressure tuning, token bucket design, incident response tooling).
Is this ticket only for tests? No. This is production server behavior for consuming and logging retry telemetry; tests are needed but are not the primary goal.
Does this ticket have any functional impact, or is it just test improvements? Functional impact: the server will parse, validate, and log retry metadata from telemetry-bearing commands and make it available to logging/metrics pipelines.
Acceptance Criteria
What specific requirements must be met to consider the design phase complete?
A server-side spec design and accompanying spec update are written and accepted that explicitly answer the following for retry telemetry:
- Input / schema handling
-
- Define the server-side schema for retry metadata (e.g., $retry with fields like r) and how it aligns with the telemetry channel from DRIVERS-3463.
- Specify validation rules (allowed fields, value ranges, and behavior on malformed or unexpected data).
- Processing and logging behavior
-
- Describe how retry metadata is ingested from the telemetry channel on each eligible command.
- Describe how it is logged and/or exported to metrics/observability systems.
- Explain expected behavior for multiple retries of the same command (e.g., how repeated r values appear in logs/metrics and how they should be interpreted).
- Scope and gating
-
- Define which commands are allowed to carry retry metadata (e.g., exclusions for monitoring/handshake commands, alignment with backpressure/retry policy).
- Document wire version / feature flag gating so retry metadata is only processed when supported and when the telemetry channel from DRIVERS-3463 is enabled.
- Integration and dependencies
-
- State explicitly that this work assumes DRIVERS-3463 has defined the centralized telemetry channel and client-side emission of retry metadata.
- Call out expected integration points with token bucket design, backpressure metrics, or other server subsystems (at a high level).
- Testing and observability
-
- Outline a test approach that covers:
- Commands with and without retry metadata.
- Valid vs invalid retry metadata.
- Typical retry loops (e.g., increments of r) and how they appear in logs/metrics.
- Provide example log/metric output formats so downstream consumers (Atlas, internal tooling) know what to expect.
- Outline a test approach that covers:
- blocks
-
DRIVERS-3465 Token Bucket retry per-server
-
- Needs Triage
-
- is blocked by
-
DRIVERS-3463 Implement Client-Side Telemetry Communication to the Server
-
- Needs Triage
-
- is related to
-
DRIVERS-3462 Implement Token bucket retry per-node
-
- Needs Triage
-