-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Cluster Scalability
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Consuming information about numReadsByRanges and numWritesByRanges is only valuable when a user can double click into which ranges are read or write heavy. Without providing such a functionality, the reads/writes by ranges output often raises more questions than answers. It can also block the output of percentage reads and writes that analyzeShardKey with readWriteDistribution: true provides due to $sample issues.
when analyzeShardKeyNumRanges is 1, we just skip doing getNext() on $sample aggregation. So, we should default it to 1.
If users would like to see the output, they can use the setParameter to set it to 100.
More context for posterity:
When calculating the keyCharacteristics metrics, the sampled documents are used to calculate the cardinality, frequency and monotonicity. For that, we do want higher precision, which is why the calculation requires sampling large number of documents (sampleRate defaults to 1 and sampleSize defaults to 1 million) and we purposely use an index scan instead of $sample even though performance testing shows the cost of a index scan is non-trivial for large collections.
When calculating the readWriteDistribution metrics, the sampled documents are only used the define the chunk boundaries so rough estimate is acceptable. The use of $sample is more tied to how it is the approach used by resharding to find chunk boundaries. Also, we expected $sample to return representative random samples when the shard key has good cardinality and frequency. We later discovered that the assumption didn't always hold due to the randomness issue in $sample
- is related to
-
WT-8003 Fix frequent duplicate keys returned by random cursor in resharding test
-
- Closed
-