-
Type: New Feature
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Query Optimization
-
Fully Compatible
-
QO 2024-03-04, QO 2024-03-18, QO 2024-04-01
Deduplication of record ids in index scans can use unbounded memory in the query subsystem. While we consider longer term solutions to this problem, we should consider whether we can gain some immediate benefit from choosing a different data structure for deduplication.
From a related slack thread on using roaring bitmaps:
I read the first half or so of this paper, and it looks encouraging. Briefly, the space of 32-bit integers is divided into 2^16 groups of consecutive integers. If more than 1/4 of the integers in any given group are “present”, then the bitmap will represent membership of the set in that group more efficiently than a hashtable. Otherwise, roaring uses a sorted array of integers to represent that group, which is about as much storage as a hash table. A further optimization involves not having to represent groups at all when they are empty or full (I think?) and a further optimization lets you represent long runs of empty or full using run-length encoding.
So, if I read this right, roaring bitmaps would save us space if, in the cases where record id sets get very large, it’s common for the record ids present in the set to be densely packed as indicated above. Otherwise, the roaring set will be (approximately) no worse than an idealized hash table for RAM consumption.
- is related to
-
SERVER-24375 Deduping in OR, SORT_MERGE, and IXSCAN (multikey case) uses unbounded memory
- Backlog
- split to
-
SERVER-87976 Use roaring bitmaps for RecordId deduplication in IndexScan stage
- Closed