[SERVER-34819] Optimize the sharding balancer's cluster statistics gathering Created: 03/May/18  Updated: 06/Feb/24

Status: Open
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Tommaso Tocci
Resolution: Unresolved Votes: 0
Labels: QWB
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Documented
Duplicate
Related
is related to SERVER-30060 Make the balancer gather storage stat... Closed
Assigned Teams:
Sharding EMEA
Sprint: Sharding EMEA 2023-10-16, Sharding EMEA 2023-10-30, CAR Team 2023-11-13, CAR Team 2023-11-27, CAR Team 2023-12-11, CAR Team 2023-12-25, CAR Team 2024-01-08, CAR Team 2024-01-22, CAR Team 2024-02-05, CAR Team 2024-02-19
Participants:
Case:

 Description   

The sharding balancer currently issues listDatabases against every single shard in order to access the totalSize value. This value is used for ensuring that a shard's storage maxSize is not exceeded for users which have that value set.

The listDatabases call is quite heavy, especially for nodes with large number of databases/collections since it will fstat every single file under the instance.

There are a number of optimizations we can make in order to make this statistics gathering less expensive in the presence of maxSize (listed in order of preference):

  • Add a parameter to listDatabases to allow it to return cached data size instead of every time {{fstat}}ing all the files
  • Issue the listDatabases call in parallel against all shards so different shards' execution overlaps
  • Cache the per-shard statistics so that they are not collected on every single round/moveChunk invocation
  • Collect the per-shard statistics asynchronously so that multiple concurrent moveChunk requests can benefit


 Comments   
Comment by Kaloian Manassiev [ 03/May/18 ]

I can think of multiple ways to do that, but I am not sure which one is realistic and/or independent of the storage engine.

For example - can the sizes be lazily updated as WT writes to its files and only for the files that it writes to/garbage collects? Not sure if this is possible in real-time, but perhaps it can be combined with an asynchronous thread which runs periodically and only {{fstat}}s the files which were written to.

Alternatively, can we use the same mechanism which is used for the fast count, combined with some multiplier of the average data size? This is what we use for chunk size estimation.

Comment by Eric Milkie [ 03/May/18 ]

I'm not sure how a cache would work. The calculation today locks each database and then iterates through every collection, adding its size plus all the collection's index sizes together. The expense here probably comes from iterating through all collections and indexes on the entire system. If this value were cached, when would it get updated?

Comment by Kaloian Manassiev [ 03/May/18 ]

milkie, the sharding balancer uses the totalSize value from listDatabases in order to estimate the data usage on a shard and running this command in the presence of large number of collections/indexes is very heavy. What is the possibility of the storage team exposing a cachedTotalSize value, which doesn't fstat every single file on the instance?

Generated at Thu Feb 08 04:37:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.