-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Unknown
-
None
-
Component/s: Backpressure, Retryability
-
None
-
Needed
Summary
Define and document a standardized, centralized mechanism for drivers to send per-operation client-side telemetry (e.g., retry metadata, OpenTelemetry contexts, and client configuration data) to the server, with the current working assumption that this will be the new OP_MSG telemetry section. The primary goal is to ensure the server can ingest and process rich client-side telemetry about retries, feature usage, and tracing, regardless of whether OP_MSG or another centralized transport is ultimately chosen.
Motivation
What is the problem or use case, what are we trying to achieve?
Today, client metadata is the only general-purpose mechanism drivers use to send telemetry to the server, but it is capped at 512 bytes and is not associated with individual operations, which prevents us from attaching richer, per-command telemetry like retry counts, feature usage, and tracing context. We need a centralized, extensible telemetry channel—currently, the OP_MSG telemetry section is the most viable candidate—that lets the server observe phenomena like retry storms, correlate behavior with client configuration, and consume OpenTelemetry signals directly from drivers; if another centralized solution emerges during design that better meets these goals, using it instead of OP_MSG is acceptable as long as it still enables the same telemetry flow from client to server.
Who is the affected end user?
Primary external end users are Atlas and self-managed operators/SREs who rely on accurate telemetry to understand overload behavior, performance regressions, and feature usage; internally this also supports server, cloud, and driver teams making design decisions based on observed client behavior rather than conjecture.
Who are the stakeholders?
Key stakeholders include Core Server / Storage Engine, the client-side backpressure architect, driver team leads across languages, and the feature/product owner for client backpressure and telemetry, as well as teams working on OpenTelemetry support and observability pipelines.
How does this affect the end user?
Without a robust centralized telemetry path, we cannot reliably measure retry behavior (e.g., frequency and depth of retry storms), nor can we tie server-side symptoms back to specific client configurations or tracing context; this makes it harder for users to diagnose incidents and for MongoDB teams to tune behaviors like backpressure and retry policies.
Are they blocked? Are they annoyed? Are they confused?
Users are not hard-blocked, but they lack actionable visibility into how drivers behave under overload and how client-side retries and tracing correlate with observed server issues, which can lead to slower incident resolution and less confidence in tuning behaviors like backpressure and retry policies.
How likely is it that this problem or use case will occur? Main path? Edge case?
As we roll out client backpressure, richer retry telemetry, and OpenTelemetry-based tracing, this becomes a main-path use case: a centralized telemetry mechanism (with OP_MSG as the current design focus) is intended to unify client configuration metadata, retry attempt metadata, and OpenTelemetry contexts across all drivers.
If the problem does occur, what are the consequences and how severe are they?
The main consequences are: (1) inability for the server to observe and react to retry storms based on client-provided metadata, (2) reduced ability to make data-driven design decisions about backpressure and driver behavior, and (3) lost opportunity to consolidate telemetry (debug logging, OpenTelemetry, retry telemetry) behind a single, consistent API and transport, leading to fragmented observability.
Is this issue urgent?
Yes. This work is a Phase 2 fast-follow item from the backpressure decision and is blocking several dependent initiatives, including token bucket design, that rely on reliable client-side telemetry to make informed design and rollout decisions.
Does this ticket have a required timeline? What is it?
Yes. This work must be completed in time for the 9.0 server release, so that the chosen telemetry mechanism is available when 9.0 ships and can be used by dependent initiatives (including token bucket design) as they roll out.
Is this ticket required by a downstream team? Needed by e.g. Atlas, Shell, Compass?
This work is implicitly required by server and Atlas observability/telemetry pipelines, which need a standardized telemetry channel to ingest client configuration, retry metadata, and OpenTelemetry context, and by driver teams implementing backpressure retry telemetry and OpenTelemetry support in a unified way; explicit downstream consumers (e.g., specific Atlas features or tools) are not enumerated in the current docs.
Is this ticket only for tests?
No. The chosen telemetry channel (currently the OP_MSG telemetry section) must be implemented as a real, production path between drivers and server; individual telemetry projects (e.g., retry telemetry, OpenTelemetry contexts) will have their own tests, but the scope here is the design and standardization of the underlying transport and semantics, not test-only infrastructure.
Does this ticket have any functional impact, or is it just test improvements?
This has functional impact: drivers will attach telemetry via the agreed centralized mechanism (initially expected to be an OP_MSG telemetry section, plus $retry as a generic argument) to eligible commands, and the server must recognize, validate, and log this data, even if it does not alter command semantics beyond telemetry processing.
Acceptance Criteria
What specific requirements must be met to consider the design phase complete?
- Centralized telemetry channel semantics and schema defined
-
- The design specifies a centralized, extensible container for driver-provided telemetry, with OP_MSG telemetry as the default design target but explicitly allowing an alternative mechanism if discovered during ideation (e.g., a different envelope in the wire protocol), provided it can carry the same telemetry with comparable performance and safety.
- That channel must support at minimum:
- Retry attempt metadata
- Client configuration metadata
- OpenTelemetry contexts
- The design spells out size limits, formatting requirements, and any security constraints for telemetry payloads (similar to existing client metadata), so the server can safely ingest and process them regardless of the specific centralized transport.
- Attachment points and eligibility rules documented
-
- The design details wireVersion / server version gating, ensuring telemetry is only attached when the connected server supports whichever centralized telemetry mechanism (OP_MSG section or alternative) we standardize on, and the $retry generic argument.
- Server-side handling and logging behavior defined
-
- The server’s behavior for the centralized telemetry channel is specified, including:
- Validation rules
- How telemetry is logged and/or forwarded to observability pipelines
- The server’s behavior for the centralized telemetry channel is specified, including:
- Cross-team alignment and sign-off captured
-
- The updated design has explicit LGTM/sign-off from:
- Server stakeholders (e.g., Core Server / Storage Engine) on telemetry validation, logging, performance characteristics, and on whether OP_MSG or another centralized mechanism is preferred.
- Driver leads for each language on feasibility and estimated implementation cost for plumbing telemetry through to the centralized channel (OP_MSG or alternative).
- Product/feature owners for client backpressure and client-side tracing/telemetry on whether the defined telemetry is sufficient for Phase 2 goals and OpenTelemetry alignment.
- The updated design has explicit LGTM/sign-off from:
- Initial POC / reference implementation identified
-
- At least one driver (e.g., Python) is identified as the reference implementation, with:
- A concrete plan to wire retry telemetry ($retry) and, where appropriate, OpenTelemetry, or client usage metadata.
- At least one driver (e.g., Python) is identified as the reference implementation, with:
- blocks
-
DRIVERS-3464 Implement server-side handling for retry metadata sent from drivers
-
- Needs Triage
-
- is related to
-
DRIVERS-3337 Client Backpressure Improvements
-
- Backlog
-