[SERVER-71951] Identify best approach to implement sampling for analyze command Created: 07/Dec/22  Updated: 29/Oct/23  Resolved: 09/Jan/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.3.0-rc0

Type: Task Priority: Major - P3
Reporter: Misha Tyulenev Assignee: Ben Shteinfeld
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-72614 Implement sampling for historgram com... Closed
Backwards Compatibility: Fully Compatible
Sprint: QO 2022-12-12, QO 2022-12-26, QO 2023-01-09, QO 2023-01-23
Participants:

 Description   

Currently analyze command runs full collection scan and since keeps all the fields for the analyzed path in memory. This will exceed 100Mb limit for large collections. To allow analyze to run on all sizes sampling needs to be introduced.

There are several approaches to adding sampling to the analyze pipeline :

  1. use $sample stage. This approach will work but will need memory for the in-memory sort if the sampling size is > 5% of collection size. Because current memory limit for the pipeline is 100Mb it will leave less memory for storing the values to build histograms on.
  2. use { $match: { $expr: { $rt: [<sample ratio>, {$rand: {} } ] } } }, this approach will not use extra memory 
  3. implement a custom $sample stage that keeps track of used memory and therefor will not generate out of memory error.

The objective of this ticket is to experiment with 1,2,3 (or may be more) to find the best approach to implement sampling.



 Comments   
Comment by Ben Shteinfeld [ 09/Jan/23 ]

Decided to move forward with the prototype. Will continue the discussion on the PR and commit under SERVER-72614.

Generated at Thu Feb 08 06:20:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.