Investigate feasibility of using indexes for indexed fields in nested aggregate expressions

XMLWordPrintableJSON

    • Query Optimization
    • None
    • 3
    • TBD
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      During performance testing for the FLE Prefix/suffix/substring initiative https://jira.mongodb.org/browse/SPM-3012 , a critical performance issue was found in the substring use case.

      The root cause is due to the way that the server rewrite of the text predicates is done. In particular, when generating a tag disjunction for aggregate expressions, the following aggregate expression is generated:

      {
         "$expr" : {
             "$or" : [
      		{
      		    "$in" : [
      			      1,
      			      "$a"
      			     ]
      		},
      		{
      		    "$in" : [
      			      2,
      			     "$a"
      			    ]
      		}
      		]
      	}
      }, 

      Where "$a" corresponds to the {}safeContent{} array, on which there is an index. This is a simplified example. In the actual substring case, we may have 1000+ nested $in expressions on "$a". In the DSI workload, we see a simple substring query take up to 14 minutes to return. 

      Upon investigation, the performance issue is due the query planner choosing a collection scan instead of leveraging the index scan. 

      It is currently not possible to express the query using match expressions, because during syntax review for the new FLE expressions, it was decided that we do not want to introduce new expressions into the match language.

      The FLE query rewrite code is designed in a way in which it is only possible to rewrite an aggregation expression with an encrypted payload into another aggregation expression with tags for the encrypted predicate, or a match expression into a match expression with the tags for the encrypted predicate. 

      It is currently not possible to express the query as :

      { a: $in [1,2,..]}

      Because that would require that $encStrContains have a match expression implementation, i.e :

      {$encStrContains: { input: "$a", substring: "eve"}}

      It is also not possible to express the existing query with a condensed $in aggregate expression, as it would require something like this :

      // This does not work, $in aggregate expression semantics mean that 
      // this expression would check for the existence of the array [1,2,3,4] //as an array element of the array $a 
      {"$expr" : { "$in" : [ [1,2,3,4] , "$a" ] } }   

      Request:

      We would like to understand what the feasibility of leveraging the index on "a" (i.e '_safeContent_') is for a nested aggregate expression like in the examples above.

      Note, this may be related to indexing on array fields. The following also does not use the index:

      db.coll.find({$expr: {$in: [1, "$a"]}})

       

            Assignee:
            Alya Berciu
            Reporter:
            Santiago Roche
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

              Created:
              Updated: