-
Type:
Task
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Cluster Scalability
-
Fully Compatible
-
ClusterScalability Dec8-Dec22
-
0
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Background
The 100GB Locust resharding workload has historically failed to enter the critical section under intense write loads due to timeouts. The system was unable to meet the entry condition(remaining estimates < 500ms) within the 6‑hour limit.
This issue was largely mitigated in SERVER-110169 by introducing short 10‑second "dips" where the write load was halved every 5 minutes. After this change, performance improved significantly, with successful runs completing the resharding phase in about 4.5 hours.
However, recent runs have begun failing again. In the latest failure, the last remaining catch‑up time estimate was 826,083 ms (~13.8 minutes), preventing the coordinator from entering the critical section. This estimate is unusually large and never dropped below the 500 ms threshold during the 6‑hour window.
Link to write‑up with additional examples.
Investigation Goal
Identify the cause of the unusually large remaining time estimates and the recurring timeout failures in the 100GB Locust resharding workload.
Note: Please use the log analysis script to investigate workload failures more effectively.
- is depended on by
-
SERVER-110748 Enable featureFlagReshardingRemainingTimeEstimateBasedOnMovingAverage
-
- Backlog
-
- is related to
-
SERVER-106550 ShardRemote::runAggregation should only return postBatchResumeToken when the batch is empty
-
- Closed
-
-
SERVER-110169 Introduce short traffic drops in reshard_collection_10_indexes_100G_locust to allow resharding to commit
-
- Closed
-
-
SERVER-115274 Make ReshardingOplogFetcher also update the average time to fetch when the batch doesn't have postBatchResumeToken
-
- Closed
-