Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-87134

Consider using roaring bitmaps for RecordId deduplication in IndexScan stages

    • Type: Icon: New Feature New Feature
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 8.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Optimization
    • Fully Compatible
    • QO 2024-03-04, QO 2024-03-18, QO 2024-04-01

      Deduplication of record ids in index scans can use unbounded memory in the query subsystem. While we consider longer term solutions to this problem, we should consider whether we can gain some immediate benefit from choosing a different data structure for deduplication.

      From a related slack thread on using roaring bitmaps:

      I read the first half or so of  this paper, and it looks encouraging. Briefly, the space of 32-bit integers is divided into 2^16 groups of consecutive integers. If more than 1/4 of the integers in any given group are “present”, then the bitmap will represent membership of the set in that group more efficiently than a hashtable. Otherwise, roaring uses a sorted array of integers to represent that group, which is about as much storage as a hash table. A further optimization involves not having to represent groups at all when they are empty or full (I think?) and a further optimization lets you represent long runs of empty or full using run-length encoding.

      So, if I read this right, roaring bitmaps would save us space if, in the cases where record id sets get very large, it’s common for the record ids present in the set to be densely packed as indicated above. Otherwise, the roaring set will be (approximately) no worse than an idealized hash table for RAM consumption.

            Assignee:
            alexander.ignatyev@mongodb.com Alexander Ignatyev
            Reporter:
            matt.broadstone@mongodb.com Matt Broadstone
            Votes:
            1 Vote for this issue
            Watchers:
            22 Start watching this issue

              Created:
              Updated:
              Resolved: