-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Unknown
-
None
-
Component/s: Retryability
-
Needed
Investigate AI-agent analysis report on the retryable-writes specification. This ticket tracks findings identified during a systematic review of the spec: missing test coverage, ambiguous normative language, and spec/test inconsistencies.
Missing Tests
- Error code 20 with errmsg starting with "Transaction numbers" MUST produce an actionable error message — prose
test only, no YAML - PoolClearedError during connection checkout MUST get RetryableWriteError label — prose test 2 only, no unified
test (PR #1804 /
DRIVERS-1815 implementing) - WriteConcernError with RetryableWriteError label + NoWritesPerformed on retry: MUST return original error —
prose test 3 only - Sharded cluster: retry MUST target a different mongos (deprioritization) — prose tests 4 & 5 only; not expressible
in YAML - Write commands inside transactions MUST NOT be retried and MUST NOT get RetryableWriteError labels
- Network error during initial connection handshake MUST get RetryableWriteError label — handshakeError.yml
exists but completeness unclear - CSOT-enabled retries: multiple attempts MUST be allowed (vs. single attempt without CSOT) — no CSOT-specific retry
count tests (DRIVERS-3247 ready for work)
Ambiguities
- Server selection deprioritization: "Failed server MUST be passed as deprioritized" — does this exclude it, rank it
lower, or merely mark it? Prose test 4 acknowledges this cannot be reliably tested without external tools. - bulkWrite eligibility granularity: Eligibility evaluated "after order and batch splitting individually" — if
UpdateMany and InsertOne are in different batches, is each batch evaluated independently? - mongod vs. mongos error detection: writeConcernError codes are only valid on mongod responses — how do drivers
distinguish in a sharded cluster where mongos forwards shard errors?
Inconsistencies
- updateMany.yml and deleteMany.yml: Both say "ignores retryWrites" and verify no txnNumber. The spec language
("MUST NOT add a transaction ID") is a firm enforcement requirement, not a behavioral choice — naming is misleading. - PoolClearedError label requirement vs. test coverage: Spec says MUST add RetryableWriteError to pool errors,
but this is driver-level (from CMAP), not server-level — the distinction in when to apply the label is not
consistently tested. - operationId guidance: Commands in a retry SHOULD share operationId but drivers SHOULD NOT use operationId to
relay transaction ID info — somewhat contradictory for multi-command bulk writes.
Notes
- 35 unified test files with good coverage of single/multi-statement operations, error labels, and server errors.
- 5 prose tests (PoolClearedError, WriteConcernError, sharded cluster deprioritization) — ~12% of critical coverage is
manual. - CSOT behavioral change (2022-10-18: "multiple retries allowed when CSOT enabled") has no corresponding test coverage.
- DRIVERS-3296 (Backlog): clarify expected behavior for command logging,
retryable writes and write concern errors. - DRIVERS-3352 (Backlog): add RetryableError labels to retryability
eligibility.
- is related to
-
DRIVERS-3296 Clarify expected behavior for command logging, retryable writes and write concern errors
-
- Backlog
-
-
DRIVERS-3352 Add RetryableError labels to retryability eligibility for retryable reads/writes
-
- Backlog
-
-
DRIVERS-3247 Improve TimeOutException handling in retryable reads/writes
-
- Ready for Work
-
- related to
-
DRIVERS-3484 Spec gap analysis: missing tests, ambiguities, and inconsistencies across all 42 components
-
- Closed
-