-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Cluster Scalability
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
-
None
If the users want to know how much data will live on each shard after they shard a collection, there is no way. A workaround is to run an aggregation with $bucketAuto which will give information about the number of docs per shard for a particular shard key which can then help calculate the data size. In addition to that, users have to run $facet to understand the number of distinct values of the shard key per range in $bucketAuto.
We should add this information to analyzeShardKey() because we already scan the index to generate split points.
Atlas [mongos] testdata> db.bigCollection.aggregate([ ... { ... {$sort: {card_number:1}} ... $bucketAuto: { ... groupBy: "$card_number", // shard key ... buckets: 3, // since you have 3 shards ... output: { ... count: { $sum: 1 } // count docs per bucket ... } ... } ... } ... ]) [ { _id: { min: '0000010055728383', max: '3333511084616141' }, count: 333333333 }, { _id: { min: '3333511084616141', max: '6666504640041919' }, count: 333333333 }, { _id: { min: '6666504640041919', max: '9999999995035362' }, count: 333333334 } ] db.bigCollection.aggregate([ {$sort: {card_number:1}}, {$facet: { shard1: [{$match: {card_number:{$gte:'0000010055728383',$lt:'3333511084616141' }}},{$group: {_id: "$card_number"}},{$count: "count"}], shard2: [{$match: {card_number:{$gte:'3333511084616141',$lt:'6666504640041919' }}},{$group: {_id: "$card_number"}},{$count: "count"}], shard3: [{$match: {card_number:{$gte:'6666504640041919',$lt:'9999999995035362' }}},{$group: {_id: "$card_number"}},{$count: "count"}] }} ]) [ { shard1: [ { count: 333333306 } ], shard2: [ { count: 333333320 } ], shard3: [ { count: 333333317 } ] } ]