Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Cluster Scalability
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The idea is to make the donor iterate over the documents instead of using the count command.

The advantage are:

Ensure that only 1 scan is running at a time, regardless of how many failovers. This prevents the issue when config server keeps on resending the count when it fails over (currently, the config server sends count to donors)
Each donor's progress is independent from each other (currently, all count command to all donors have to succeed at the same time)
Since we are iterating over the documents, we will be able to get regular response from the query layer, allowing us to slowly store progress and thus be able to resume when it gets interrupted or fails over (the edge case is when there's a huge continuous range of orphans it has to iterate over).
There is no network traffic since everything is executed locally.

Some minor details:

Must ensure that query is properly versioned so orphans will be filtered out.
We might still want to scan the compatible shard key index so the queries will be covered. However, to make it covered, we have to only project the index fields. This can be problematic if there are multiple docs with the same shard key. Maybe the solution is to use the same $natural order query + resumeToken used by the cloner (this means we will have to pin to a node unless recordIds are replicated).
Must use snapshot read at minFetchTimestamp.

We will also need a new command for coordinator to extract this info from the donor: extractReshardingDonorInitialDocCount or maybe make one of the transition commands from shard authoritative track (SPM-3978) return this value.

is depended on by

SERVER-112444 Resharding validation aggregation shouldn't hint the _id index

Backlog

SERVER-123996 Add metrics for measuring how long it takes to gather donor doc count for validation

Backlog

is related to

SERVER-112444 Resharding validation aggregation shouldn't hint the _id index

Backlog

related to

SERVER-120482 Consider splitting resharding validator's count call into smaller ranges

Closed

Make donor count validation resumable after failover

SERVER-123998

Backlog

Unassigned

Assignee:: Unassigned
Reporter:: Randolph Tan
Participants:: Randolph Tan
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Feb 27 2026 09:55:10 PM UTC
Updated:: Apr 13 2026 08:26:04 PM UTC

Details

Description

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates