Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- cs-impact-3

Assigned Teams:

Cluster Scalability
Operating System:
ALL
Sprint:
Cluster Scalability Priorities
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Note: This issue only happens in 8.0 and lower because this change in 8.1 made it so that the fetched count is updated after inserting an oplog batch into the buffer.

There is a resharding bug where the critical section can never be engaged due the in-memory oplogEntriesFetched count being larger than the actual number of oplog entries in the resharding oplog buffer (when they are expected to match). This happens through the following mechanism:

Fetcher increments the oplogEntriesFetched count before inserting an oplog entry batch into the buffer collection
The insert into the buffer can fail for a variety of reasons such as a WriteConflict exception.
The fetcher retries inserting the oplog entry batch, but does not reset the count it erroneously added. This then causes the oplogEntriesFetched count to be permanently higher than what's actually in the buffer.

If oplogEntriesFetched is permanently higher than what exists in the buffer, then oplogEntriesApplied can never catch up (since oplogEntriesApplied reflects how many buffered entries have been applied and is therefore bounded by the number of entries in the oplog buffer).

When the oplog application phase runs long enough, this incorrect "remaining work" (fetched - applied) can push the estimated time to apply the remaining work to be above the critical section entry threshold (estimatedTime = timeInApplying * (fetched / applied - 1)), resulting in the critical section never being engaged.

As part of fixing this we might also want to consider adding tests that inject random failures into the appliers and fetchers at key points (such as when inserting into the oplog buffer) to ensure that resharding is robust to these failures and doesn't miscount fetched/applied.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

resharding_oplog_fetched_metric_overcount.js
6 kB
Feb 02 2026 08:09:50 PM UTC

is depended on by

SERVER-118722 Add coverage for random failures during oplog application and fetching in resharding

Backlog

related to

SERVER-119255 Refactor resharding appliers and fetchers so that we can inject failures in unit tests

Backlog

Assignee:: Unassigned
Reporter:: Wenqin Ye
Participants:: Wenqin Ye
Votes:: 0 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Feb 02 2026 08:06:28 PM UTC
Updated:: Feb 25 2026 09:42:25 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates