Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-81515

Add tokenize() function on the ValueBlock interface

    XMLWordPrintableJSON

Details

    • Icon: Task Task
    • Resolution: Fixed
    • Icon: Major - P3 Major - P3
    • 7.2.0-rc0
    • None
    • None
    • None
    • Fully Compatible

    Description

      Add a method to the ValueBlock interface which "tokenizes" the input. That is, it identifies the set of unique items in the block. It should return:
      -An array of tokens in an sbe::Array
      -A vector of N integers, where N is the size of the input block. Each integer is an index into the token array.

      For example:

      Input ValueBlock

      ["foo", "bar", "baz", "bar", "bar", 999, "foo"]
      

      ValueBlock->tokenize() returns:

      tokens: ["foo", "bar", "baz", 999]
      // 0 corresponds to foo, 1 corresponds to bar, etc
      values: [0, 1, 2, 1, 1, 3, 0]
      

      The default implementation can use a basic hashing algorithm (make sure to use the same hasher that the HashAgg stage uses).

      We should also have a special implementation for MonoBlock which is optimized. Eventually we will add an optimized version for Homogeneous blocks and possibly RLE compressed blocks.

      Attachments

        Activity

          People

            parker.felix@mongodb.com Parker Felix
            ian.boros@mongodb.com Ian Boros
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: