Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-69646

[Optimization] Consider making analyzeShardKey command calculate correlation coefficient in batches

    • Type: Icon: Task Task
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability

      The check for monotonicity in the analyzeShardKey command currently relies on calculating the correlation coefficient between the WiredTiger RecordIds (i.e. y) in the index store and  1, ..., N (i.e. x) where N is the number of recordIds. Given that each RecordId is 8-byte, for a collection has 10 million unique shard key values, the check would involve storing ~2 * 80MB of x and y in memory. While the check should be fast, the memory usage can still have a non-negligible impact on the server. To this end, we should consider calculating the correlation coefficient in batches of size N' where N' is more manageable. This paper describes a way average correlation coefficients (r), specifically by transforming the r values using a Fisher's z transformation and then taking the average of the z values and converting it back to an r value.

            Assignee:
            backlog-server-cluster-scalability [DO NOT USE] Backlog - Cluster Scalability
            Reporter:
            cheahuychou.mao@mongodb.com Cheahuychou Mao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: