[SERVER-81515] Add tokenize() function on the ValueBlock interface Created: 27/Sep/23  Updated: 29/Oct/23  Resolved: 12/Oct/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.2.0-rc0

Type: Task Priority: Major - P3
Reporter: Ian Boros Assignee: Parker Felix
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Participants:

 Description   

Add a method to the ValueBlock interface which "tokenizes" the input. That is, it identifies the set of unique items in the block. It should return:
-An array of tokens in an sbe::Array
-A vector of N integers, where N is the size of the input block. Each integer is an index into the token array.

For example:

Input ValueBlock

["foo", "bar", "baz", "bar", "bar", 999, "foo"]

ValueBlock->tokenize() returns:

tokens: ["foo", "bar", "baz", 999]
// 0 corresponds to foo, 1 corresponds to bar, etc
values: [0, 1, 2, 1, 1, 3, 0]

The default implementation can use a basic hashing algorithm (make sure to use the same hasher that the HashAgg stage uses).

We should also have a special implementation for MonoBlock which is optimized. Eventually we will add an optimized version for Homogeneous blocks and possibly RLE compressed blocks.



 Comments   
Comment by Githook User [ 12/Oct/23 ]

Author:

{'name': 'Parker Felix', 'email': 'parker.felix@mongodb.com', 'username': 'parker-felix'}

Message: SERVER-81515 Add tokenize() to the ValueBlock interface
Branch: master
https://github.com/mongodb/mongo/commit/88a742c5169970d9b7fdd1a236198c5132ce7fe6

Comment by Ian Boros [ 27/Sep/23 ]

Assigning this to Parker for whenever ongoing work finishes. There is no rush on this one either.

Generated at Thu Feb 08 06:46:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.