-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Query Execution
-
QE 2026-05-25
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Overview
When many concurrent operations hit the same hot document they enter an indefinite write conflict retry loop. SERVER-65418 introduces releasing the write ticket before sleeping on a write conflict, which allows other operations to proceed during the backoff. However this also means each retry frees a ticket that is immediately grabbed by another operation hitting the same document, creating a positive feedback loop that can exhaust write throughput across the entire server. This ticket investigates whether we should detect this condition and reject retrying operations early rather than letting the storm grow unbounded.
Options to Evaluate
Option 1 - Reject based on ticket pool size (primary candidate)
Track the number of operations currently in the write conflict retry loop. If it exceeds a threshold proportional to the write ticket pool size (e.g. 50%), reject further retries with an error. The threshold scales automatically with pool size, which throughput probing adjusts dynamically. An activation gate also requires the write ticket queue depth to exceed a percentage of pool size, so the breaker stays off when the system has spare capacity.
Option 2 - Per-operation retry cap
Introduce a configurable maximum number of write conflict retries per operation for user queries. Simple to implement but does not respond to system-wide pressure.
Option 3 - Per-collection tracking
Same as Option 1 but scoped to the collection experiencing the storm. More precise but requires a sharded map keyed by collection UUID. Deferred to v2.
Scope
- User-facing queries only. Internal ops and secondaries retain unlimited retry.
- Integration point is PlanExecutorImpl::_handleNeedYield in src/mongo/db/query/plan_executor_impl.cpp, before the yield that releases the write ticket.
Decision Needed
Confirm which option to implement, the error code to return (TemporarilyUnavailable vs a new WriteConflictStorm code), and default threshold values.
- is related to
-
SERVER-65418 Query plan executor must release resources before backing off
-
- In Progress
-
- related to
-
SERVER-126694 jstest + design: write-conflict storm early-reject (SERVER-126462)
-
- Needs Verification
-