-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Cluster Scalability
-
2
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Based on HELP-84458, it appears that the resharding cloner can easily produce more load than replication can handle in certain hardware configurations. Today, the resharding cloner does writes locally and does not wait for majority before proceeding to the next write. After cloning all documents, the first time the recipient will wait for replication before proceeding is prior to building indexes (after SERVER-103566).
This is good for overall throughput, but maximizes the strain resharding puts on replication and the rest of the system. We can already throttle resharding to some extent using parameters like reshardingCollectionClonerBatchSizeCount and reshardingCollectionClonerWriteThreadCount, but these are not directly aware of the current level of replication lag. Until SPM-2935 and SPM-4263 can address this in a more complete way, we may want to consider adding an option to force the resharding cloner to await replication periodically as a stop gap solution.
- is related to
-
SERVER-100264 Resharding Natural Order Pipeline Does Not Respect reshardingCollectionClonerBatchSizeInBytes
-
- Closed
-
-
SERVER-103566 Make ReshardingRecipientService wait for replication lag across all nodes to be some threshold before building indexes
-
- Closed
-