-
Type:
Improvement
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Cluster Scalability
-
None
-
None
-
None
-
None
-
None
-
None
-
None
If the users want to know how much data will live on each shard after they shard a collection, there is no way. A workaround is to run an aggregation with $bucketAuto which will give information about the number of docs per shard for a particular shard key which can then help calculate the data size. In addition to that, users have to run $facet to understand the number of distinct values of the shard key per range in $bucketAuto.
We should add this information to analyzeShardKey() because we already scan the index to generate split points.
Atlas [mongos] testdata> db.bigCollection.aggregate([
... {
... {$sort: {card_number:1}}
... $bucketAuto: {
... groupBy: "$card_number", // shard key
... buckets: 3, // since you have 3 shards
... output: {
... count: { $sum: 1 } // count docs per bucket
... }
... }
... }
... ])
[
{
_id: { min: '0000010055728383', max: '3333511084616141' },
count: 333333333
},
{
_id: { min: '3333511084616141', max: '6666504640041919' },
count: 333333333
},
{
_id: { min: '6666504640041919', max: '9999999995035362' },
count: 333333334
}
]
db.bigCollection.aggregate([
{$sort: {card_number:1}},
{$facet: {
shard1: [{$match: {card_number:{$gte:'0000010055728383',$lt:'3333511084616141' }}},{$group: {_id: "$card_number"}},{$count: "count"}],
shard2: [{$match: {card_number:{$gte:'3333511084616141',$lt:'6666504640041919' }}},{$group: {_id: "$card_number"}},{$count: "count"}],
shard3: [{$match: {card_number:{$gte:'6666504640041919',$lt:'9999999995035362' }}},{$group: {_id: "$card_number"}},{$count: "count"}]
}}
])
[
{
shard1: [ { count: 333333306 } ],
shard2: [ { count: 333333320 } ],
shard3: [ { count: 333333317 } ]
}
]