The sharding balancer currently issues listDatabases against every single shard in order to access the totalSize value. This value is used for ensuring that a shard's storage maxSize is not exceeded for customers which have that value set.
The listDatabases call is quite heavy, especially for nodes with large number of databases/collections since it will fstat every single file under the instance.
There are a number of optimizations we can make in order to make this statistics gathering less expensive (listed in order of preference):
- Only gather storage statistics for shards which have maxSize set (implemented by this ticket)
- Issue the listDatabases call in parallel against all shards so it doesn't take so different shards' execution overlaps
- Cache the per-shard statistics so that they are not collected on every single round/moveChunk invocation
- Collect the per-shard statistics asynchronously so that multiple concurrent moveChunk requests can benefit
- Add a parameter to listDatabases to allow it to return cached data size instead of every time {{fstat}}ing all the files
- related to
-
SERVER-34819 Optimize the sharding balancer's cluster statistics gathering
- Closed