[SERVER-16505] .explain on query with $gt never finish Created: 11/Dec/14  Updated: 15/May/15  Resolved: 15/May/15

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Querying
Affects Version/s: 2.6.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Hendrik Swanepoel Assignee: Ramon Fernandez Marina
Resolution: Cannot Reproduce Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File explained.txt     File query_plan.json    
Operating System: ALL
Participants:

 Description   

We have a collection with approximately 5 million records.
One of the fields is called last_updated, it's a date field and we created an index for it.

When we query the collection with

{last_updated: {$gt: new Date(2012,5,1)}}

we get the cursor back very quickly. Similar with a count.

When we add a .explain() to the query, it never completes (we waited for 30 minutes) and our disks go beserk with read activity.

When we change the query to

{last_updated: new Date(2012,5,1)}

and run it with .explain(), it behaves as expected.

As an aside:

We had issues creating that index in production, where we've got replica sets and sharding (and 10 mil records). We pulled some of our data into a standalone instance in order to play around with the index without affecting production. When we had issues in production, it had similar symptoms (disks would be slammed with read activity). We didn't run .explain() to trigger our production issue, but it seems spookily similar to this issue. Just deleting the index would cause the system to operate normally again.



 Comments   
Comment by Ramon Fernandez Marina [ 20/Mar/15 ]

The "explain" output with limit(1) is not very useful as it shows that the query is doing a collection scan (which is fast enough for one document). As for the "query_plan.json" output, it shows:

    "cursor" : "BtreeCursor last_updated_1",
    "n" : 5906203,
    "nscannedObjects" : 5906203,
    "nscanned" : 5906203,
    "nYields" : 350922,
    ...
    "yields" : 350922,
    "needFetch" : 332825,

This looks like a large index scan was performed, followed like lots of fetching from the disk; in other words, the data was "cold", and the server yielded many times while retrieving data from disk.

The part that I haven't seen before is the "Date()" in the index bounds:

    "indexBounds" : {
        "last_updated" : [
            [
                ISODate("2012-05-31T22:00:00.000Z"),
                Date(9223372036854775807)
            ]
        ]
    },

I'll look into this, but so far I haven't had any luck reproducing. hendrik@22seven.com, does a repeated run of the query with explain(true) show the same problem after the data has been paged into memory?

Comment by Hendrik Swanepoel [ 13/Jan/15 ]

I attached a file containing the output after letting it run for 50 minutes. It seems like it's hitting the index, but then scanning all the documents? (query_plan.json)

Comment by Mor G [ 11/Jan/15 ]

Happens to me too.

db.collection_name.find() - returns normally
db.collection_name.find().explain() - never returns
db.collection_name.find().limit(1).explain() - returns

Attached the output of the explain result with limit(1)

Comment by Mor G [ 11/Jan/15 ]

Explained output

Comment by Ramon Fernandez Marina [ 11/Dec/14 ]

hendrikswan, can you clarify whether this issue appears on a standalone instance? I'm unable to reproduce this behavior on my end:

  • started a 2.6.6 standalone instance
  • inserted 5M documents that look like this:

    > db.foo.findOne()
    {
            "_id" : ObjectId("5489e6abe7f8cb1244ec9179"),
            "x" : 512,
            "d" : ISODate("1996-11-22T21:01:59Z")
    }

  • added an index to the d field:

    > db.foo.ensureIndex({d:1})

  • run a query using $gt and explain():

    > db.foo.find({d:{$gt: new Date(2000,01,01)}}).explain()
    {
            "cursor" : "BtreeCursor d_1",
            "isMultiKey" : false,
            "n" : 1652339,
            "nscannedObjects" : 1652339,
            "nscanned" : 1652339,
            "nscannedObjectsAllPlans" : 1652339,
            "nscannedAllPlans" : 1652339,
            "scanAndOrder" : false,
            "indexOnly" : false,
            "nYields" : 12908,
            "nChunkSkips" : 0,
            "millis" : 1297,
            "indexBounds" : {
                    "d" : [
                            [
                                    ISODate("2000-02-01T05:00:00Z"),
                                    ISODate("0NaN-NaN-NaNTNaN:NaN:NaNZ")
                            ]
                    ]
            },
            "server" : "tab:4444",
            "filterSet" : false
    }

  • I also reduced the slowms threshold to zero:

    db.setProfilingLevel(0,0)

    and this is what I see in the logs:

    2014-12-11T13:58:01.817-0500 [conn18] query test.foo query: { query: { d: { $gt: new Date(949381200000) } }, $explain: true } planSummary: IXSCAN { d: 1.0 } ntoreturn:0 ntoskip:0 nscanned:1652339 nscannedObjects:1652339 keyUpdates:0 numYields:1 locks(micros) r:2268912 nreturned:1 reslen:1121 1297ms

How do your documents look like and what other indexes do you have? My guess is that the query planner is using an index that, while it appears it's a better choice, ends up being sub-optimal and causes a table scan somewhere. Can you try adding a limit(1) to the query with explain() and post the output? Here's the equivalent for the example above:

db.foo.find({d:{$gt: new Date(2000,01,01)}}).limit(1).explain(true)

Generated at Thu Feb 08 03:41:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.