-
Type:
Task
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Fully Compatible
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Add a method to the ValueBlock interface which "tokenizes" the input. That is, it identifies the set of unique items in the block. It should return:
-An array of tokens in an sbe::Array
-A vector of N integers, where N is the size of the input block. Each integer is an index into the token array.
For example:
Input ValueBlock
["foo", "bar", "baz", "bar", "bar", 999, "foo"]
ValueBlock->tokenize() returns:
tokens: ["foo", "bar", "baz", 999] // 0 corresponds to foo, 1 corresponds to bar, etc values: [0, 1, 2, 1, 1, 3, 0]
The default implementation can use a basic hashing algorithm (make sure to use the same hasher that the HashAgg stage uses).
We should also have a special implementation for MonoBlock which is optimized. Eventually we will add an optimized version for Homogeneous blocks and possibly RLE compressed blocks.