[SERVER-69646] [Optimization] Consider making analyzeShardKey command calculate correlation coefficient in batches Created: 13/Sep/22 Updated: 12/Dec/23 |
|
| Status: | Open |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Cheahuychou Mao | Assignee: | Backlog - Cluster Scalability |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Cluster Scalability
|
||||||||
| Participants: | |||||||||
| Description |
|
The check for monotonicity in the analyzeShardKey command currently relies on calculating the correlation coefficient between the WiredTiger RecordIds (i.e. y) in the index store and 1, ..., N (i.e. x) where N is the number of recordIds. Given that each RecordId is 8-byte, for a collection has 10 million unique shard key values, the check would involve storing ~2 * 80MB of x and y in memory. While the check should be fast, the memory usage can still have a non-negligible impact on the server. To this end, we should consider calculating the correlation coefficient in batches of size N' where N' is more manageable. This paper describes a way average correlation coefficients (r), specifically by transforming the r values using a Fisher's z transformation and then taking the average of the z values and converting it back to an r value. |
| Comments |
| Comment by Adi Zaimi [ 06/Jun/23 ] |
|
I have lost the original reference I found, but I could find [META-ANALYSIS OF CORRELATION and [A note on combining correlations|https://link.springer.com/content/pdf/10.3758/BF03334158.pdf|https://link.springer.com/content/pdf/10.3758/BF03334158.pdf].=] which describe a few ways this can be accomplished. Note that the Z-transform method mentioned in the description probably refers to the Silver & Dunlap (1987) paper.
|