-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Cluster Scalability
-
ALL
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Note: This issue only happens in 8.0 and lower because this change in 8.1 made it so that the fetched count is updated after inserting an oplog batch into the buffer.
There is a resharding bug where the critical section can never be engaged due the in-memory oplogEntriesFetched count being larger than the actual number of oplog entries in the resharding oplog buffer (when they are expected to match). This happens through the following mechanism:
- Fetcher increments the oplogEntriesFetched count before inserting an oplog entry batch into the buffer collection
- The insert into the buffer can fail for a variety of reasons such as a WriteConflict exception.
- The fetcher retries inserting the oplog entry batch, but does not reset the count it erroneously added. This then causes the oplogEntriesFetched count to be permanently higher than what's actually in the buffer.
If oplogEntriesFetched is permanently higher than what exists in the buffer, then oplogEntriesApplied can never catch up (since oplogEntriesApplied reflects how many buffered entries have been applied and is therefore bounded by the number of entries in the oplog buffer).
When the oplog application phase runs long enough, this incorrect "remaining work" (fetched - applied) can push the estimated time to apply the remaining work to be above the critical section entry threshold (estimatedTime = timeInApplying * (fetched / applied - 1)), resulting in the critical section never being engaged.
As part of fixing this we might also want to consider adding tests that inject random failures into the appliers and fetchers at key points (such as when inserting into the oplog buffer) to ensure that resharding is robust to these failures and doesn't miscount fetched/applied.
- is depended on by
-
SERVER-118722 Add coverage for random failures during oplog application and fetching in resharding
-
- Backlog
-
- related to
-
SERVER-119255 Refactor resharding appliers and fetchers so that we can inject failures in unit tests
-
- Needs Scheduling
-