-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Minor - P4
-
None
-
Affects Version/s: None
-
Component/s: None
-
Query Integration
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
-
None
During performance testing for the FLE Prefix/suffix/substring initiative https://jira.mongodb.org/browse/SPM-3012 , a critical performance issue was found in the substring use case. The root cause was due to the fact that encrypted text search predicates were only added to the aggregation language, not find (i.e $match expressions). Support for encrypted text predicates relied on an existing implementation for making a tag disjunction with the generated FLE tags (see EncryptedPredicate::makeTagDisjunction ) . This code was implemented during the development of encrypted Equality and encrypted Range predicates. Unfortunately, the tag disjunction that is generated in the aggregation case was not being optimized correctly, which led to the index on {}safeContent{} not being used at all.
This problem exists not only for text predicates, but also for aggregation expressions which use encrypted equality or range predicates.
An initial solution in the query optimizer was provided (see https://jira.mongodb.org/browse/SERVER-107804), which allows the rewriting of $in clauses in a $expr to a match equivalent such that the index scan could be leveraged.
However, in cases where the number of tags we are searching for is very high (i.e $encStrContains: {input: "$email", substring: "@gma"}, the residual predicate evaluation turns out to be far too slow for the acceptance testing. A performance comparison between a possible solution in the FLE rewrite step when compared to the QO solution showed the performance of the QO rewrite solution was twice as slow as that of the FLE rewrite step.
To overcome this issue, we implemented a special rewrite logic in query rewriter, which tries to convert an aggregate expression inside a $expr, by pushing down the $expr as much as possible to sub-expressions and converting the expressions to a match equivalent. Then, instead of using the aggregate disjunction for the tags, we generate a tag disjunction in the match language, which performs better than aggregate one using $in (see https://github.com/10gen/mongo/blob/master/src/mongo/db/query/fle/query_rewriter.cpp#L233). This was implemented under https://jira.mongodb.org/browse/SERVER-108179. This fix is only applicable to encrypted text search predicates, however, we should attempt the same for equality and range, as it is possible that in large collections with a high cardinality filter that we would see a similar issue for equality and range.
Proposed solution:
1) The current workaround introduced a new method TextSearchPredicate::rewriteToTagDisjunctionAsMatch. If we want to apply this same logic to equality and range, we could use the same approach as our usual rewrite path -> that is, creating a map of rewriters, mapping an Expression to a match rewriter. See relevant code in EncryptedPredicate.
// accepts an aggregation Expression* but returns a MatchExpression*
using ExpressionRewriteAsMatchFunction = std::function<std::unique_ptr<MatchExpression>(QueryRewriterInterface*, Expression*)>;
using ExpressionToRewriteAsMatchMap = stdx::unordered_map<std::type_index, std::vector<ExpressionRewriteAsMatchFunction>>;
extern ExpressionToRewriteAsMatchMap aggPredicateRewriteAsMatchMap;
Above, we create a new type of mapping for an Expression*, one that will return a unique_ptr<MatchExpression>.
We then need to modify the encrypted predicate registration code, to register the rewriters for the new agg to match rewrite map (see registration code), but in this case, call the new method rewriteToTagDisjunctionAsMatch().
/** * Register an agg rewrite if a condition is true at startup time. */ #define REGISTER_ENCRYPTED_AGG_PREDICATE_REWRITE_GUARDED(className, rewriteClass, isEnabledExpr) \ MONGO_INITIALIZER(encryptedAggPredicateRewriteFor_##className##_##rewriteClass) \ (InitializerContext*) { \ if (aggPredicateRewriteMap.find(typeid(className)) == aggPredicateRewriteMap.end()) { \ aggPredicateRewriteMap[typeid(className)] = std::vector<ExpressionRewriteFunction>(); \ } \ aggPredicateRewriteMap[typeid(className)].push_back([](auto* rewriter, auto* expr) { \ if (isEnabledExpr) { \ return rewriteClass{rewriter}.rewrite(expr); \ } else { \ return std::unique_ptr<Expression>(nullptr); \ } \ }); \ // Here, we need to register the rewriter for agg to match: if (aggPredicateRewriteAsMatchMap.find(typeid(className)) == aggPredicateRewriteAsMatchMap.end()) { \ aggPredicateRewriteAsMatchMap[typeid(className)] = std::vector<ExpressionRewriteAsMatchFunction>(); \ } \ aggPredicateRewriteAsMatchMap[typeid(className)].push_back([](auto* rewriter, auto* expr) { \ if (isEnabledExpr) { \ return rewriteClass{rewriter}.rewriteToTagDisjunctionAsMatch(expr); \ } else { \ return std::unique_ptr<Expression>(nullptr); \ } \ }); \ };
2) We must then move the method rewriteToTagDisjunctionAsMatch() so that it is a pure virtual method on the EncryptedPredicate base class.
3) 1 & 2 would allow us to refactor the code in QueryRewriter::_rewriteExprMatchExpression, instead of relying on a dynamic cast, we could use the rewriter map to find a rewriter for agg to match, and use that instead.
4) Optional: The special rewrite logic, which pushes $expr down in $and/$or trees is guarded by a check, which detects if we have an encrypted text search predicate in the expression tree. We may want to expand this logic, so that we also perform the special rewrites for equality and range.
This detection is currently handled by EncryptedTextSearchExpressionDetector. This check is very basic, traversing until it finds an encrypted text search node. We could extend coverage so that it finds encrypted equality or encrypted range expressions as well.
Note, we have not yet had any reports of performance issues with equality and range, but that may be because customers favour using the find language, and would not typically encounter this bug. We want to do steps 1-3 so that queries with encrypted text search predicates AND encrypted equality/range can also create match style disjunctions for the equality/range predicates of the query.
However, it should be safe to perform the special rewrite for equality and range in the absence of a text predicate.
Option 1)
Our unit tests rely on a virtual call that was introduced to QueryRewriter,[ _initializeTextSearchPredicateInternal|https://github.com/10gen/mongo/blob/master/src/mongo/db/query/fle/query_rewriter.cpp#L359]. This is required so that we can provide our custom QueryRewriter in query_rewriter_test.cpp with an test mock of a TextSearchPredicate that generates tags in a specific way, and detects payloads correctly in our unit tests.
I would recommend adding a new member to QueryRewriter, an unordered_map that maps <Expression*, std::unique_ptr<EncryptedPredicate>>. We could add a new virtual method that initializes the mapping with the encrypted predicates, so we can provide our own mocked predicates in our unit tests.
We could then provide a mapping of the Expressions to their corresponding EncryptedPredicate type, and use it to check if we have an expression that has an encrypted predicate:
std::unique_ptr<Expression> postVisit(Expression* exp) { // This detector only cares if we find at least one encrypted expression in the tree. // If we've already found one, we can return early. if (_detected) { return nullptr; } const auto& predicateMap = _queryRewriter.getEncryptedPredicateMap(); if (auto encryptedPredicateIter = exprRewrites.find(typeid(*exp)); encryptedPredicateIter != encryptedPredicateIter.end()) { if(encryptedPredicateIter->hasValidPayload(exp)) { _detected = true; } } return nullptr; }
Note: We could also improve the detector by being a bit smarter about how we traverse the tree. Currently, we detect encrypted predicates even if they are below an expression that can't be converted to a match expression. If we wanted to be really smart about it, we could traverse, and keep a stack with a context. We would need to implement
EncryptedTextSearchExpressionDetector: : preVisit(), push a context on the stack, which informs the traversal about whether or not the current subtree we are in allows conversion to a match expression. For some inspo, look at the subtree stack in the aggregation intender classes.
- is related to
-
SERVER-108179 FLE queries with encrypted text predicates fail to generate an index scan
-
- Closed
-
-
SERVER-107804 Investigate feasibility of using indexes for indexed fields in nested aggregate expressions
-
- Closed
-