Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Query Execution
Sprint:
QE 2026-05-25, QE 2026-06-08
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Overview

When many concurrent operations hit the same hot document they enter an indefinite write conflict retry loop. ~~SERVER-65418~~ introduces releasing the write ticket before sleeping on a write conflict, which allows other operations to proceed during the backoff. However this also means each retry frees a ticket that is immediately grabbed by another operation hitting the same document, creating a positive feedback loop that can exhaust write throughput across the entire server. This ticket investigates whether we should detect this condition and reject retrying operations early rather than letting the storm grow unbounded.

Options to Evaluate

Option 1 - Reject based on ticket pool size (primary candidate)

Track the number of operations currently in the write conflict retry loop. If it exceeds a threshold proportional to the write ticket pool size (e.g. 50%), reject further retries with an error. The threshold scales automatically with pool size, which throughput probing adjusts dynamically. An activation gate also requires the write ticket queue depth to exceed a percentage of pool size, so the breaker stays off when the system has spare capacity.

Option 2 - Per-operation retry cap

Introduce a configurable maximum number of write conflict retries per operation for user queries. Simple to implement but does not respond to system-wide pressure.

Option 3 - Per-collection tracking

Same as Option 1 but scoped to the collection experiencing the storm. More precise but requires a sharded map keyed by collection UUID. Deferred to v2.

Scope

User-facing queries only. Internal ops and secondaries retain unlimited retry.
Integration point is PlanExecutorImpl::_handleNeedYield in src/mongo/db/query/plan_executor_impl.cpp, before the yield that releases the write ticket.

Decision Needed

Confirm which option to implement, the error code to return (TemporarilyUnavailable vs a new WriteConflictStorm code), and default threshold values.

is related to

SERVER-65418 Query plan executor must release resources before backing off

Closed

Assignee:: Zixuan Zhuang
Reporter:: Zixuan Zhuang
Participants:: Zixuan Zhuang
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: May 12 2026 06:11:08 PM UTC
Updated:: Jun 04 2026 05:53:40 PM UTC

Details

Description

Overview

Options to Evaluate

Option 1 - Reject based on ticket pool size (primary candidate)

Option 2 - Per-operation retry cap

Option 3 - Per-collection tracking

Scope

Decision Needed

Attachments

Issue Links

Activity

People

Dates