Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Cluster Scalability
Backwards Compatibility:
Fully Compatible
Sprint:
ClusterScalability Dec8-Dec22
Linked BF Score:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Background

The 100GB Locust resharding workload has historically failed to enter the critical section under intense write loads due to timeouts. The system was unable to meet the entry condition(remaining estimates < 500ms) within the 6‑hour limit.

This issue was largely mitigated in ~~SERVER-110169~~ by introducing short 10‑second "dips" where the write load was halved every 5 minutes. After this change, performance improved significantly, with successful runs completing the resharding phase in about 4.5 hours.

However, recent runs have begun failing again. In the latest failure, the last remaining catch‑up time estimate was 826,083 ms (~13.8 minutes), preventing the coordinator from entering the critical section. This estimate is unusually large and never dropped below the 500 ms threshold during the 6‑hour window.

Link to write‑up with additional examples.

Investigation Goal

Identify the cause of the unusually large remaining time estimates and the recurring timeout failures in the 100GB Locust resharding workload.

Note: Please use the log analysis script to investigate workload failures more effectively.

is depended on by

SERVER-110748 Enable featureFlagReshardingRemainingTimeEstimateBasedOnMovingAverage

Closed

is related to

SERVER-106550 ShardRemote::runAggregation should only return postBatchResumeToken when the batch is empty

Closed

SERVER-110169 Introduce short traffic drops in reshard_collection_10_indexes_100G_locust to allow resharding to commit

Closed

SERVER-115274 Make ReshardingOplogFetcher also update the average time to fetch when the batch doesn't have postBatchResumeToken

Closed

related to

SERVER-108944 Investigate resharding with replication lag workload on all feature flag variant

Closed

Assignee:: Cheahuychou Mao
Reporter:: Kruti Shah
Participants:: Cheahuychou Mao, Kruti Shah
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Dec 03 2025 04:50:23 PM UTC
Updated:: Jan 22 2026 02:36:11 PM UTC
Resolved:: Dec 11 2025 08:33:34 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates