[SERVER-68855] Optimize $collStats for $shardedDataDistribution. Created: 16/Aug/22  Updated: 29/Oct/23  Resolved: 11/Nov/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 6.2.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Pol Castuera (Inactive) Assignee: Pol Pinol
Resolution: Fixed Votes: 0
Labels: shardingemea-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-70859 Optimize collStats to not retrieve al... Closed
is depended on by SERVER-70859 Optimize collStats to not retrieve al... Closed
Gantt Dependency
Related
Backwards Compatibility: Fully Compatible
Sprint: Sharding EMEA 2022-10-31, Sharding EMEA 2022-11-14
Participants:
Story Points: 3

 Description   

Motivation: Performance Improvement.

Description: Two different designs:

  1. Create a new input parameter 'dataDistribution' on $collStats to retrieve all necessary data (count, avgObjSize and numOrphanDocuments) to run the $shardedDataDistribution and other uses. It will not retrieve unnecessary data.
  2. The idea is to analyze the entire pipeline and see (via existing analysis) that it only references a handful of paths, and if anything else will not impact the results, it can be optimized. For example, as long as the request with $collStats or $allCollectionStats also includes the $project, we could filter the information we need. Inspect whether the $collstats stage is followed by a $project, and from there determine what output the $collstats should produce. 

 

Actual performance without optimizing $collStats (sharded collection - seconds):

[js_test:all_collection_stats] performance test $shardedDataDistribution: 1000 00:00:05
[js_test:all_collection_stats] performance test $shardedDataDistribution: 2000 00:00:10
[js_test:all_collection_stats] performance test $shardedDataDistribution: 3000 00:00:16
[js_test:all_collection_stats] performance test $shardedDataDistribution: 4000 00:00:23
[js_test:all_collection_stats] performance test $shardedDataDistribution: 5000 00:00:33

 

Performance test used:

(function() {
'use strict';
 
const numberOfCollections = 5000;
 
// Configure initial sharding cluster
const st = new ShardingTest({shards: 3});
const mongos = st.s;
const dbName = "test";
const db = mongos.getDB(dbName);
 
let iterator = 1000;
let total = 0;
while (total < numberOfCollections) {
    // Insert data to validate the aggregation stage
    for (let i = 0; i < iterator; i++) {
        const coll = "coll" + total;
        // assert.commandWorked(db.createCollection(coll));
        assert(st.adminCommand({shardcollection: dbName + "." + coll, key: {skey: 1}}));
        total++;
    }
 
    let it = 0;
    const start = new Date();
    const cursor = mongos.getDB("admin").aggregate([{$shardedDataDistribution: {}}]);
    while (cursor.hasNext()) {
        const data = cursor.next();
        it++;
    }
    const end = new Date();
 
    const time = new Date(end - start).toISOString().slice(11, 19);
    print(`performance test $shardedDataDistribution: ` + total + ` ` + time);
 
    assert.eq(it, total + 1);
}
 
st.stop();
})();



 Comments   
Comment by Githook User [ 11/Nov/22 ]

Author:

{'name': 'Pol Piñol Castuera', 'email': '67922619+PolPinol@users.noreply.github.com', 'username': 'PolPinol'}

Message: SERVER-68855 Optimize $collStats for $shardedDataDistribution
Branch: master
https://github.com/mongodb/mongo/commit/b4414b6651c8c815d8629f4655e606d4d2046537

Comment by Garaudy Etienne [ 19/Aug/22 ]

Cloud InTel thinks this would help greatly with the PM-2323 changes we've made.

Generated at Thu Feb 08 06:11:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.