Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22815

Investigate if there is a better cutoff for an optimized $sample than 5% of the collection



    • Type: Task
    • Status: Open
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: Backlog
    • Component/s: Aggregation Framework
    • Labels:
    • Backwards Compatibility:
      Fully Compatible


      When SERVER-19182 was implemented, we chose 5% as the cutoff for when we will switch from the optimized $sampleFromRandomCursor to the normal $sample implementation.

      The $sampleFromRandomCursor implementation will do repeated random walks over a tree (in currently supported storage engines), whereas the $sample implementation will do a full collection scan, then a top-k sort based on an injected random value.

      It is thought that the $sample implementation will be faster after a certain threshold percentage. This is because a collection scan likely has a data access pattern of large sequential reads, where the random tree walks do a bunch of random point accesses. Especially on spinning disks, the former becomes more appealing as you look at a larger and larger percent of the collection.

      We should do some benchmarking to see if 5% is a good cutoff for a variety of setups. It will likely depend on at least the following factors:

      • Storage engine
      • Type of disk
      • Amount of memory
      • Number of documents in the collection
      • Size of documents

      It may be very hard to find a number that is suited to all combinations, but it may be that there is a better choice than 5%.




            • Votes:
              0 Vote for this issue
              8 Start watching this issue


              • Created: