Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- cbr_ce_sources

Assigned Teams:

Query Optimization
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Specifying a sampleRate > 1 does not result in any noticeable performance improvement. The reason is that the command:

 db.foo.runCommand({analyze: "foo", key: "a", sampleRate: 0.001});

Results in an internal pipeline that looks like this:

{"$project":{"val":"$a"}},
{"$group":{"_id":"a","statistics":{
    "$_internalConstructStats":{
        "val":"$$ROOT",
        "sampleRate":Double(0.001),
        "numberBuckets":100
    }
}}}]

Which is a COLLSCAN plus $group over the entire table.

Instead, the internal pipeline should look like this:

{$sample: {size: 0.001 * db.foo.estimatedDocumentCount()}},
{"$project":{"val":"$a"}},
{"$group":{"_id":"a","statistics":{
"$_internalConstructStats":{
"val":"$$ROOT",
"sampleRate":Double(0.001),
"numberBuckets":100
}
}}}]

Which has a plan with MULTI_ITERATOR and sampleFromRandomCursor before the $project and $group . The performance is substantially faster.

related to

SERVER-71951 Identify best approach to implement sampling for analyze command

Closed

Assignee:: Unassigned
Reporter:: Philip Stoev
Participants:: Philip Stoev
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Jan 21 2025 08:35:00 AM UTC
Updated:: Jun 25 2025 06:47:03 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates