Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-71951

Identify best approach to implement sampling for analyze command

    XMLWordPrintableJSON

Details

    • Icon: Task Task
    • Resolution: Fixed
    • Icon: Major - P3 Major - P3
    • 6.3.0-rc0
    • None
    • None
    • None
    • Fully Compatible
    • QO 2022-12-12, QO 2022-12-26, QO 2023-01-09, QO 2023-01-23

    Description

      Currently analyze command runs full collection scan and since keeps all the fields for the analyzed path in memory. This will exceed 100Mb limit for large collections. To allow analyze to run on all sizes sampling needs to be introduced.

      There are several approaches to adding sampling to the analyze pipeline :

      1. use $sample stage. This approach will work but will need memory for the in-memory sort if the sampling size is > 5% of collection size. Because current memory limit for the pipeline is 100Mb it will leave less memory for storing the values to build histograms on.
      2. use { $match: { $expr: { $rt: [<sample ratio>, {$rand: {} } ] } } }, this approach will not use extra memory 
      3. implement a custom $sample stage that keeps track of used memory and therefor will not generate out of memory error.

      The objective of this ticket is to experiment with 1,2,3 (or may be more) to find the best approach to implement sampling.

      Attachments

        Activity

          People

            ben.shteinfeld@mongodb.com Ben Shteinfeld
            misha.tyulenev@mongodb.com Misha Tyulenev
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: