-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Query Optimization
Specifying a sampleRate > 1 does not result in any noticeable performance improvement. The reason is that the command:
db.foo.runCommand({analyze: "foo", key: "a", sampleRate: 0.001});
Results in an internal pipeline that looks like this:
{"$project":{"val":"$a"}}, {"$group":{"_id":"a","statistics":{ "$_internalConstructStats":{ "val":"$$ROOT", "sampleRate":Double(0.001), "numberBuckets":100 } }}}]
Which is a COLLSCAN plus $group over the entire table.
Instead, the internal pipeline should look like this:
{$sample: {size: 0.001 * db.foo.estimatedDocumentCount()}}, {"$project":{"val":"$a"}}, {"$group":{"_id":"a","statistics":{ "$_internalConstructStats":{ "val":"$$ROOT", "sampleRate":Double(0.001), "numberBuckets":100 } }}}]
Which has a plan with MULTI_ITERATOR and sampleFromRandomCursor before the $project and $group . The performance is substantially faster.
- related to
-
SERVER-71951 Identify best approach to implement sampling for analyze command
- Closed