Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-71951

Identify best approach to implement sampling for analyze command

    • Type: Icon: Task Task
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 6.3.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Fully Compatible
    • QO 2022-12-12, QO 2022-12-26, QO 2023-01-09, QO 2023-01-23

      Currently analyze command runs full collection scan and since keeps all the fields for the analyzed path in memory. This will exceed 100Mb limit for large collections. To allow analyze to run on all sizes sampling needs to be introduced.

      There are several approaches to adding sampling to the analyze pipeline :

      1. use $sample stage. This approach will work but will need memory for the in-memory sort if the sampling size is > 5% of collection size. Because current memory limit for the pipeline is 100Mb it will leave less memory for storing the values to build histograms on.
      2. use { $match: { $expr: { $rt: [<sample ratio>, {$rand: {} } ] } } }, this approach will not use extra memory 
      3. implement a custom $sample stage that keeps track of used memory and therefor will not generate out of memory error.

      The objective of this ticket is to experiment with 1,2,3 (or may be more) to find the best approach to implement sampling.

            Assignee:
            ben.shteinfeld@mongodb.com Ben Shteinfeld
            Reporter:
            misha.tyulenev@mongodb.com Misha Tyulenev (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: