Add tokenize() function on the ValueBlock interface

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Fixed
    • Priority: Major - P3
    • 7.2.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • None
    • Fully Compatible
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Add a method to the ValueBlock interface which "tokenizes" the input. That is, it identifies the set of unique items in the block. It should return:
      -An array of tokens in an sbe::Array
      -A vector of N integers, where N is the size of the input block. Each integer is an index into the token array.

      For example:

      Input ValueBlock

      ["foo", "bar", "baz", "bar", "bar", 999, "foo"]
      

      ValueBlock->tokenize() returns:

      tokens: ["foo", "bar", "baz", 999]
      // 0 corresponds to foo, 1 corresponds to bar, etc
      values: [0, 1, 2, 1, 1, 3, 0]
      

      The default implementation can use a basic hashing algorithm (make sure to use the same hasher that the HashAgg stage uses).

      We should also have a special implementation for MonoBlock which is optimized. Eventually we will add an optimized version for Homogeneous blocks and possibly RLE compressed blocks.

            Assignee:
            Parker Felix
            Reporter:
            Ian Boros
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: