Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-47925

Add a new $elemMatch-like operator which matches both arrays and scalars

    • Type: Icon: New Feature New Feature
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.0.16
    • Component/s: Querying
    • Labels:
    • Query Execution

      Imagine a collection where the field "a" can store either an integer or an array of integers:

      > db.c.find();
      {_id: 1, a: 3}
      {_id: 2, a: [4, 11]}
      {_id: 3, a: [-1, 10]}
      

      Now, suppose that a user wishes to find documents where either "a" is a number in the interval [0, 10], or "a" is an array containing at least one number in the interval [0, 10]. This query should match documents _id:1 and _id:2, but not _id:3. A naive way to write this query would be like so:

      > db.c.find({a: {$gte: 0, $lte: 10}});
      {_id: 1, a: 3}
      {_id: 2, a: [4, 11]}
      {_id: 3, a: [-1, 10]}
      

      As you can see, the semantics of this query are to match all three documents, which was not our intention. This is because there is no requirement that an individual array element satisfy both the $gte and the $lte predicate simultaneously. In order to express this simultaneity for array elements, the match language offers $elemMatch:

      > db.c.find({a: {$elemMatch: {$gte: 0, $lte: 10}}});
      {_id: 2, a: [4, 11]}
      

      Unfortunately, as you can see above, this query is not quite right either. _id:3 no longer matches, which is good, but _id:1 also no longer matches. Recall that the query which we want to express should match _id:1 and _id:2. However, the $elemMatch predicate filters out documents where "a" is not an array. This scenario motivates a feature request. We should have an operator in the match language which is akin to $elemMatch, but also matches scalars:

      > db.c.find({a: {$inventedNewKindOfElemMatch: {$gte: 0, $lte: 10}}});
      {_id: 1, a: 3}
      {_id: 2, a: [4, 11]}
      

      This hypothetical new operator would also allow the query planner to use tight index bounds over multikey indexes, which should avoid any slow plans with loose index bounds such as those discussed in the original issue report below.

      Original description

      We have a collection of 28 million records that have a IP field that is sometimes a NumberInt (for IPv4), sometimes BinData (for IPv6), and sometimes an array of two elements of either type (we sometimes need to track two addresses).

      Historically this was a regular index – IP was just a scalar – and way back in those days we had scripts that would run range queries to find people associated with a CIDR block or an ASN routing block. Before we went multi-key these range queries were super fast however today I found that now a range query is so slow I'm not sure it's actually doing anything:

      db.accounts.count({IP:{$gte:NumberInt(-656117760),$lte:NumberInt(-656116737)}})

      I waited for over 5 minutes before aborting it.

      If I convert this into a dumb loop of 1023 scalar queries it's super fast:

      var count = 0; for(var i = -656117760; i <= -656116737; ++i) { count += db.accounts.count(

      {IP:NumberInt(i)}

      ); } print(count);

      This took about 2 seconds, if that.

      I double checked with query-explain and it looks like it's still doing an IXSCAN so I can't understand why the smart query should take so long (see attached explain.txt). The only difference seems to be that when it's scanning a multi-key index the index bounds are infinite on one side.

      I did a quick skim of the 4.2 release notes and didn't see any mention of this being improved - sorry if I missed it.

        1. explain.txt
          5 kB
          shaggie76

            Assignee:
            backlog-query-execution [DO NOT USE] Backlog - Query Execution
            Reporter:
            glen.miner Glen Miner
            Votes:
            1 Vote for this issue
            Watchers:
            12 Start watching this issue

              Created:
              Updated: