Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.2.0-rc0
Affects Version/s: None
Component/s: None
Labels:
- resharding-success-rate-improvements

Assigned Teams:

Cluster Scalability
Backwards Compatibility:
Fully Compatible
Backport Requested:

v8.1, v8.0
Sprint:
ClusterScalability Apr14-Apr28
Linked BF Score:
200
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The aggregation command run by the ReshardingCollectionCloner against the each donor shard specifies readPreference "nearest" to avoid overloading the donor's primary. Currently, the command doesn't specify maxStalenessSeconds. As a result, the aggregate command could end up targeting a node that is so stale and later end up transitioning to the RECOVERING state because the oplog entries it needs have truncated etc. If the transition occurs while the read is waiting for the readConcern's afterClusterTime, by design the operation will be not interrupted. So the resharding operation would get stuck trying to start the cloning phase until the node is shutdown.

According to the replication team, Atlas has the mechanism to detect that a node has transitioned to RECOVERING and perform intervention to turn the node back into a healthy state (e.g. via initial sync). So supposedly, this fix is only necessary for non-Atlas use cases.

Assignee:: Cheahuychou Mao
Reporter:: Cheahuychou Mao
Participants:: Cheahuychou Mao, Githook User
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Apr 08 2025 06:26:00 PM UTC
Updated:: May 20 2025 08:08:21 PM UTC
Resolved:: Apr 17 2025 12:55:14 AM UTC

Details

Description

Attachments

Activity

People

Dates