[SERVER-46487] The mongos routing for scatter/gather ops can have unbounded latency Created: 28/Feb/20  Updated: 29/Oct/23  Resolved: 16/Apr/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.0, 4.0.0, 4.2.0
Fix Version/s: 4.0.19, 4.2.7, 3.6.19, 4.4.0-rc3, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Josef Ahmad Assignee: Gregory Noma
Resolution: Fixed Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongos_routing_opt.patch    
Issue Links:
Backports
Duplicate
is duplicated by SERVER-47222 Mongos high cpu usage on getShardIdsF... Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2, v4.0, v3.6
Sprint: Sharding 2020-04-20
Participants:
Case:

 Description   
Issue Status as of April 2020

ISSUE SUMMARY

The mongoS routing of scatter/gather operations iterates over the boundaries for the collection’s chunks sorted in ascending order. As the list is iterated, shards that own a chunk for the collection are discovered. If all shards in the cluster that own a chunk for the collection are discovered partway through the iteration, the iteration early-exits and the mongoS targets all of these shards.

This logic is suboptimal if a large number of the sorted chunks are initially owned by a subset of shards. This can result in a large number of iterations needed to discover all shards. As an example, on a 2-shard cluster with a collection where the first 100k sorted chunks are owned on shard 1, the mongoS iterates over the boundaries for 100k chunks when routing a scatter/gather operation.

USER IMPACT

On a sharded collection where a large number of the sorted chunks (e.g. 100k) are initially owned by a subset of shards, a scatter/gather operation can display increased latency (e.g. milliseconds) or increased mongoS CPU utilization.
This chunk distribution can result from adding a shard to the cluster, as contiguous low chunk ranges get migrated into the new shard.

WORKAROUND

Rearrange the chunks in a collection so that the initial chunks in the sorted list discover all the shards. The script below can be used to verify the number of iterations required to route a scatter/gather operation.

var ns = 'mydb.mycoll'
var nShards = db.getSiblingDB('config').shards.count();
var count = 0;
var shards = [];
print('Iterating ' + ns + ' chunks sorted in ascending order...')
var cursor = db.getSiblingDB('config').chunks.aggregate([{$match : {ns: ns}}, {$sort : { min : 1 }}]);
while (cursor.hasNext() && shards.length != nShards) {
   let nextShard = cursor.next().shard;
   if (!shards.includes(nextShard)) {
      shards.push(nextShard);
      print("  discovered shard " + nextShard + " at chunk iteration " + count)
   };
   ++count;
}
print('Done. ' + count + ' chunk iterations to route a scatter/gather operation.')



 Comments   
Comment by Githook User [ 24/Jun/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-46487 Add fast path to getShardIdsForRange() when chunk range is [MinKey, MaxKey]

(cherry picked from commit d9d92a15561dc84223d302475e09d0fa91309550)
Branch: v3.6
https://github.com/mongodb/mongo/commit/1dc3198b115c4c0aaee87e2cec56b65c5c4ac620

Comment by Githook User [ 23/Apr/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-46487 Add fast path to getShardIdsForRange() when chunk range is [MinKey, MaxKey]

(cherry picked from commit d9d92a15561dc84223d302475e09d0fa91309550)
Branch: v4.0
https://github.com/mongodb/mongo/commit/2124aa0b6673f4316a99af28c2eacc8c3d8b1441

Comment by Githook User [ 23/Apr/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-46487 Add fast path to getShardIdsForRange() when chunk range is [MinKey, MaxKey]

(cherry picked from commit d9d92a15561dc84223d302475e09d0fa91309550)
Branch: v4.4
https://github.com/mongodb/mongo/commit/ad478267c27b2b5f36cb39ad8c150081eaec9644

Comment by Githook User [ 23/Apr/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-46487 Add fast path to getShardIdsForRange() when chunk range is [MinKey, MaxKey]

(cherry picked from commit d9d92a15561dc84223d302475e09d0fa91309550)
Branch: v4.2
https://github.com/mongodb/mongo/commit/96d26475d7fb4ff6937c75b822b8ab573b63f860

Comment by Githook User [ 16/Apr/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-46487 Add fast path to getShardIdsForRange() when chunk range is [MinKey, MaxKey]
Branch: master
https://github.com/mongodb/mongo/commit/d9d92a15561dc84223d302475e09d0fa91309550

Comment by Cen Zheng [ 14/Apr/20 ]

Hi, Max,

Queries that aren't targeted by the shard key will be broadcasted to all shards which own chunks for the collection. The list of shards which own chunks for the collection corresponds to the keys of the RoutingTableHistory::_shardVersions map. We're looking to introduce another optimization (again when the atClusterTime read concern option hasn't been specified) such that if _rt->overlappingRanges(min, max, true) were going to cause all chunks to be iterated over, then we could instead populate the shardIds set with the keys of the RoutingTableHistory::_shardVersions map and skip doing the iteration itself. 

This is exactly what I mean! I will upload my patch here for your reference. In fact, I didn't realize there is an existing checkAllElementsAreOfType(MinKey, ...) function so I made one in the BSONObj class again. Now that I can use this one instead!

Thank you and please forward my regards to Esha!

Comment by Max Hirschhorn [ 14/Apr/20 ]

Assigning this one to you gregory.noma.

This logic iterates over the chunks sorted by their string representation. As the sorted list is iterated, shards that own a chunk for the collection are discovered. If all shards in the cluster are discovered part-way through the iteration, the logic early exits and targets all shards.

The early-exiting is referring to this part of ChunkManager::getShardIdsForRange(). As noted in Cen Zhang's comment above, the !_clusterTime part of that condition makes it so this optimization only applies when using the node's view of the latest routing information and not when the atClusterTime read concern option has been specified.

Queries that aren't targeted by the shard key will be broadcasted to all shards which own chunks for the collection. The list of shards which own chunks for the collection corresponds to the keys of the RoutingTableHistory::_shardVersions map. We're looking to introduce another optimization (again when the atClusterTime read concern option hasn't been specified) such that if _rt->overlappingRanges(min, max, true) were going to cause all chunks to be iterated over, then we could instead populate the shardIds set with the keys of the RoutingTableHistory::_shardVersions map and skip doing the iteration itself. You could think about it as being RoutingTableHistory::overlappingRanges(min, max, true) returned {RoutingTableHistory::_chunkMap.begin(), RoutingTableHistory::_chunkMap.end()}.

I'll leave whether we want to express the optimization that way to you to figure out. Esha had also pointed me to the checkAllElementsAreOfType(MinKey, ...) and checkAllElementsAreOfType(MaxKey, ...) that we do here as a technique we could use to avoid calling RoutingTableHistory::overlappingRanges() by comparing the min and max BSONObjs directly. We'll likely want to backport these changes so their simplicity (e.g. ease of reasoning about their correctness) is also an important factor.

As part of this work, we should add a microbenchmark to chunk_manager_refresh_bm.cpp in order to demonstrate the performance improvement being made. I don't believe we have an existing case where ChunkManager::getShardIdsForRange() is called for the range [MinKey, MaxKey]. The optimalShardSelector() chunk distribution should see the most benefit because each shard owning a contiguous series of ranges in-order is when it'd currently take the longest to see all shards by iterating over the ranges.

Comment by Cen Zheng [ 13/Apr/20 ]

Hi, 

As I mentioned in SERVER-47222

"To resolve this issue, I think we can have a fast path for getShardIdsForRange() when we are not reading from a snapshot and the query range(all shard key fields) is [MinKey, MaxKey], we only need to return all ShardIds through getAllShardIds()."

What do you think of this optimization? Is it possible? Looking forward for your feedback.

Thanks!

Comment by KV Prasad [ 02/Mar/20 ]

Looking for a sooner resolution of this and hope to see it implemented in the upcoming patch.

Generated at Thu Feb 08 05:11:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.