Details
-
Task
-
Resolution: Fixed
-
Major - P3
-
None
-
None
-
None
-
Fully Compatible
Description
Add a method to the ValueBlock interface which "tokenizes" the input. That is, it identifies the set of unique items in the block. It should return:
-An array of tokens in an sbe::Array
-A vector of N integers, where N is the size of the input block. Each integer is an index into the token array.
For example:
Input ValueBlock
["foo", "bar", "baz", "bar", "bar", 999, "foo"]
|
ValueBlock->tokenize() returns:
tokens: ["foo", "bar", "baz", 999]
|
// 0 corresponds to foo, 1 corresponds to bar, etc
|
values: [0, 1, 2, 1, 1, 3, 0]
|
The default implementation can use a basic hashing algorithm (make sure to use the same hasher that the HashAgg stage uses).
We should also have a special implementation for MonoBlock which is optimized. Eventually we will add an optimized version for Homogeneous blocks and possibly RLE compressed blocks.