Priority: Major - P3
Affects Version/s: None
Fix Version/s: WT2.7.0
Trying to capture the fundamental requirements here in generic terms here, would be happy to do even more in-detail practical examples etc. as required.
One common pattern we have is to have one primary table with ‘business objects’ (BO) that uses e.g. an ‘UUID’ as the primary key (tweaked to be temporally sortable with a custom collator), and a number of secondary indices (SI-1..SI-n) of various types (UUIDs, custom DateTimes, integers, strings, …).
We need to serve a complex query of data (arbitrary complexity referencing any number of attributes of the BO, range checks, equality checks, with typical logical operators etc.).
Our current implementation makes an analysis of the query to find ‘candidate indexes’ and takes the one with the ‘highest priority’ (as pre-defined by a human at compile time) and uses that to make a rough initial filtering on either the range or equality test relevant for that index, then we post-process the results with our own custom decision diagram filtering (which is also later used for filtering live updates going forward) before returning data to the querying client.
Some of the queries would benefit tremendously of applying multiple indexes simultaneously, as it can e.g. be a combination of date-time range and the user owning a given BO. A simple example would be ’give me all my BO modified in the last week which are owned by user X’, where we have both a date-time and user index available - the individual indexes are both half-bad, but in combination they would yield a very good rough initial filtering.
Essentially what we are looking for is inner joins / multi-index searching capabilities, which would allow us to apply multiple indexes to our read cursors, both with range and equality checks using custom comparators/collators. We are primarily looking at AND:ing the secondary index queries.
We discussed one way to implement this ourselves previously (for AND with multiple indexes). The idea was to create a bloom filter based on the primary key for each secondary index except one, merge all of those bloom filters together, and then iterate over the final remaining secondary index and check vs. the aggregated bloom filter for existence. As the Bloom would guarantee no false negatives, we would be able to discard any record returned from the final index scan that did not exist in the Bloom. A false positive we can easily handle in our final post-filter stage anyhow (as we usually would have additional filter criteria on things that are not indexed, we need to do this regardless).
We have a business object type called ‘Trade’, which has the following attributes (it is simplified in this example):
Primary key: trade identifier
Secondary keys: portfolio, venue created datetime, host modified datetime, instrument, root strategy
Typical queries could be similar to:
In 1., we would like to use the index for venue created datetime in conjunction with the Portfolio.
In 2., we would like to use the index for venue created datetime in conjunction with the Instrument.
It could be even more indexes that matched, it is not necessarily limited to two.