Details
Description
ISSUE SUMMARY
The mongoS routing of scatter/gather operations iterates over the boundaries for the collection’s chunks sorted in ascending order. As the list is iterated, shards that own a chunk for the collection are discovered. If all shards in the cluster that own a chunk for the collection are discovered partway through the iteration, the iteration early-exits and the mongoS targets all of these shards.
This logic is suboptimal if a large number of the sorted chunks are initially owned by a subset of shards. This can result in a large number of iterations needed to discover all shards. As an example, on a 2-shard cluster with a collection where the first 100k sorted chunks are owned on shard 1, the mongoS iterates over the boundaries for 100k chunks when routing a scatter/gather operation.
USER IMPACT
On a sharded collection where a large number of the sorted chunks (e.g. 100k) are initially owned by a subset of shards, a scatter/gather operation can display increased latency (e.g. milliseconds) or increased mongoS CPU utilization.
This chunk distribution can result from adding a shard to the cluster, as contiguous low chunk ranges get migrated into the new shard.
WORKAROUND
Rearrange the chunks in a collection so that the initial chunks in the sorted list discover all the shards. The script below can be used to verify the number of iterations required to route a scatter/gather operation.
var ns = 'mydb.mycoll'
|
var nShards = db.getSiblingDB('config').shards.count();
|
var count = 0;
|
var shards = [];
|
print('Iterating ' + ns + ' chunks sorted in ascending order...')
|
var cursor = db.getSiblingDB('config').chunks.aggregate([{$match : {ns: ns}}, {$sort : { min : 1 }}]);
|
while (cursor.hasNext() && shards.length != nShards) {
|
let nextShard = cursor.next().shard;
|
if (!shards.includes(nextShard)) {
|
shards.push(nextShard);
|
print(" discovered shard " + nextShard + " at chunk iteration " + count)
|
};
|
++count;
|
}
|
print('Done. ' + count + ' chunk iterations to route a scatter/gather operation.')
|
Attachments
Issue Links
- is duplicated by
-
SERVER-47222 Mongos high cpu usage on getShardIdsForRange while dealing shard key range query
-
- Closed
-