Core Server
  1. Core Server
  2. SERVER-4180

Allow multiple constraints (on same field) within a $elemMatch clause to be used with a multikey index

    Details

    • Type: Improvement Improvement
    • Status: Closed Closed
    • Priority: Major - P3 Major - P3
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.1.2
    • Component/s: Indexing, Querying
    • Labels:
    • Environment:
      linux ubuntu
    • Backport:
      No
    • Backward Breaking:
      No
    • # Replies:
      17
    • Last comment by Customer:
      true

      Description

      While bumping from 1.8 to mongo 2.0.1, we noticed some query optimization seemed not to work anymore. After some laborious digging in our data, we found out a single document showing the issue.

      Please open the attached .tgz. The tc.sh script will restore the document in a core.picture_set_pictures collection on a localhost:27017 mongo, setup an index, and explain a range query on a date embedded in a array of objects. The relevant section of the output shows.

      "indexBounds" :

      { "votes.created_at" : [ [ ISODate("2010-10-31T13:37:20Z"), ISODate("292278995-01--2147483647T07:12:56.808Z") ] ] }

      The upper bound is completely aberrant, and as far as I can tell, nothing in the document justifies such a value. As a consequence, on the real database, this will scan half of the table for every request. On the contrary, 1.8.3 was showing the right upper bound. Please also note that as far as we can tell, our collections have other instances of documents raising the problem (that is, a collection with just this document will exhibit the issue, but just removing the doc from the collection is not enough for the issue to disappear).

      1. testcase.date_broken_range.tgz
        0.8 kB
        Mathieu Poumeyrol

        Issue Links

          Activity

          Hide
          auto
          added a comment -

          Author:

          {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'}

          Message: SERVER-4180 Document FieldRange and FieldRangeSet constructors.
          Branch: master
          https://github.com/mongodb/mongo/commit/af775374ea8b2184cd4c311468c2f86338876d9c

          Show
          auto
          added a comment - Author: {u'login': u'astaple', u'name': u'Aaron', u'email': u'aaron@10gen.com'} Message: SERVER-4180 Document FieldRange and FieldRangeSet constructors. Branch: master https://github.com/mongodb/mongo/commit/af775374ea8b2184cd4c311468c2f86338876d9c
          Hide
          Aaron Staple (Inactive)
          added a comment -

          We've implemented an optimization allowing the $elemMatch operator to utilize multiple bounds on a multikey index. So if you do a query { a:{ $elemMatch:

          { $gte:0, $lte:10 }

          } } on index

          { a:1 }

          , only the a-values 0 through 10 will be scanned.

          This optimization doesn't apply to nested $elemMatch queries, for example { a:{ $elemMatch:{ b:

          { $gte:0, $lte:10 }

          } } on index

          { 'a.b':1 }

          . Since the $elemMatch operator is applied to a and not 'a.b', only one bound constraint can be applied to the values of 'a.b', and in this case all 'a.b' values >= 0 would be scanned.

          In the above situation you might consider a query like { 'a.b':{ $elemMatch:

          { $gte:0, $lte:10 }

          } } instead. Keep in mind though that (pending SERVER-6050) $elemMatch can only be applied to arrays. So the query { 'a.b':{ $elemMatch:

          { $gte:0, $lte:10 }

          } } would match the documents { _id:0, a:

          { b:[ 5 ] }

          } and { _id:1, a:[

          { b:[ 5 ] }

          ] } but not { _id:2, a:[

          { b:5 }

          ] }.

          Show
          Aaron Staple (Inactive)
          added a comment - We've implemented an optimization allowing the $elemMatch operator to utilize multiple bounds on a multikey index. So if you do a query { a:{ $elemMatch: { $gte:0, $lte:10 } } } on index { a:1 } , only the a-values 0 through 10 will be scanned. This optimization doesn't apply to nested $elemMatch queries, for example { a:{ $elemMatch:{ b: { $gte:0, $lte:10 } } } on index { 'a.b':1 } . Since the $elemMatch operator is applied to a and not 'a.b', only one bound constraint can be applied to the values of 'a.b', and in this case all 'a.b' values >= 0 would be scanned. In the above situation you might consider a query like { 'a.b':{ $elemMatch: { $gte:0, $lte:10 } } } instead. Keep in mind though that (pending SERVER-6050 ) $elemMatch can only be applied to arrays. So the query { 'a.b':{ $elemMatch: { $gte:0, $lte:10 } } } would match the documents { _id:0, a: { b:[ 5 ] } } and { _id:1, a:[ { b:[ 5 ] } ] } but not { _id:2, a:[ { b:5 } ] }.
          Hide
          Mathieu Poumeyrol
          added a comment -

          I'm sorry to be a pain here, but this fix is completely irrelevant to the issue.

          Worse, the implementation of the support for scalar values in $elemMatch being the root cause of the apparition of the regression in the first time, I fear these changes have brought us even farther from an actual resolution.

          I insist: this is a sneaky but major regression (a few other people have mentioned difficulties with the same problem on the irc channel) from 1.8. I sincerely hope we can have it fixed for 2.2. In any case, we definitely need to reopen it.

          Show
          Mathieu Poumeyrol
          added a comment - I'm sorry to be a pain here, but this fix is completely irrelevant to the issue. Worse, the implementation of the support for scalar values in $elemMatch being the root cause of the apparition of the regression in the first time, I fear these changes have brought us even farther from an actual resolution. I insist: this is a sneaky but major regression (a few other people have mentioned difficulties with the same problem on the irc channel) from 1.8. I sincerely hope we can have it fixed for 2.2. In any case, we definitely need to reopen it.
          Hide
          Aaron Staple (Inactive)
          added a comment -

          Hi Mathieu - I suggest you change your schema slightly and add an elemMatch to your original query, per the following example. If SERVER-6050 is implemented you will not have to change the schema from your original. Feel free to comment on that issue.

          c = db.c;
          c.drop();
          c.save(
                 {
                     "_id" : "--ITkd13HZw:c9o5i05059t3t-Xw-EIPkVE90",
                         "votes" : [
                                    {
                                        "browser_id" : "fwAAAU5IAJajhSIyVZiGAg==",
                                            "kind" : "nominate",
                                            "pending" : {
                                            "review" : true
                                                },
                                            "created_at" : [ISODate("2011-10-29T15:49:46.592Z")],
                                                "sql_id" : 3285089,
                                                "_id" : 3285089,
                                                "user_ip" : "189.82.222.127",
                                                "user" : "c9o5i05059t3t",
                                                "score" : 1
                                                },
                                    {
                                        "browser_id" : "fwAAAU6r166/BhUvNXcCAg==",
                                            "comment" : "",
                                            "kind" : "score",
                                            "created_at" : [ISODate("2011-10-29T16:31:30.014Z")],
                                            "sql_id" : 3285177,
                                            "_id" : 3285177,
                                            "relevance" : 1,
                                            "origin" : "albums/fotopedia-en-Chaise_longue",
                                            "user_ip" : "93.16.21.63",
                                            "user" : "oGrgsfdBDZM",
                                            "score" : 2,
                                            "quality" : 1
                                            }
                                    ]
                         }
                 );
          c.ensureIndex( { 'votes.created_at':1 } );
          printjson( c.find( { 'votes.created_at':{ $elemMatch:{ "$gte" : ISODate("2010-10-31T13:37:20Z"), "$lte" : ISODate("2010-10-31T13:37:35") } } } ).hint("votes.created_at_1").explain() );
          
          Show
          Aaron Staple (Inactive)
          added a comment - Hi Mathieu - I suggest you change your schema slightly and add an elemMatch to your original query, per the following example. If SERVER-6050 is implemented you will not have to change the schema from your original. Feel free to comment on that issue. c = db.c; c.drop(); c.save( { "_id" : "--ITkd13HZw:c9o5i05059t3t-Xw-EIPkVE90", "votes" : [ { "browser_id" : "fwAAAU5IAJajhSIyVZiGAg==", "kind" : "nominate", "pending" : { "review" : true }, "created_at" : [ISODate("2011-10-29T15:49:46.592Z")], "sql_id" : 3285089, "_id" : 3285089, "user_ip" : "189.82.222.127", "user" : "c9o5i05059t3t", "score" : 1 }, { "browser_id" : "fwAAAU6r166/BhUvNXcCAg==", "comment" : "", "kind" : "score", "created_at" : [ISODate("2011-10-29T16:31:30.014Z")], "sql_id" : 3285177, "_id" : 3285177, "relevance" : 1, "origin" : "albums/fotopedia-en-Chaise_longue", "user_ip" : "93.16.21.63", "user" : "oGrgsfdBDZM", "score" : 2, "quality" : 1 } ] } ); c.ensureIndex( { 'votes.created_at':1 } ); printjson( c.find( { 'votes.created_at':{ $elemMatch:{ "$gte" : ISODate("2010-10-31T13:37:20Z"), "$lte" : ISODate("2010-10-31T13:37:35") } } } ).hint("votes.created_at_1").explain() );
          Hide
          Mathieu Poumeyrol
          added a comment - - edited

          This collection is big, and there are actually several ones impacted.

          I asked formally what was going to happen. Once I understood what was happenning, I gave you a testable case, and I POCed a patch. You guys told me there would be a fix for 2.2.

          Now, height months later, I'm stuck with a choice between a "maybe" and a huge and complicated data migration on several collections, including the biggest and most critical of my dataset.

          Please schedule 6050 for 2.2.

          I understand the rush for a 2.2 release... but come on. In the current state, seeing this qualified as "fixed" feels like a very bad joke on me.

          Show
          Mathieu Poumeyrol
          added a comment - - edited This collection is big, and there are actually several ones impacted. I asked formally what was going to happen. Once I understood what was happenning, I gave you a testable case, and I POCed a patch. You guys told me there would be a fix for 2.2. Now, height months later, I'm stuck with a choice between a "maybe" and a huge and complicated data migration on several collections, including the biggest and most critical of my dataset. Please schedule 6050 for 2.2. I understand the rush for a 2.2 release... but come on. In the current state, seeing this qualified as "fixed" feels like a very bad joke on me.

            People

            • Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 44 weeks, 3 days ago
                Date of 1st Reply: