|
Hi royrez@microsoft.com,
We have reviewed the change and decided not to backport, in order to avoid any risk of destabilizing the count path in 3.2. If this fix is critical for your application, please upgrade to version 3.4, which contains the fix.
I don't think there is an easy workaround, but if we find one we will describe it on this ticket.
Best,
Dave
|
|
Dave,
Is there any workaround for the 3.2 branch?
I'm seeing very slow queries that by "explain" should run with the COUNT_SCAN plan, but in practice run with the IX_SCAN plan, causing them to run VERY slowly.
It also looks like a minor change, maybe worth backporting?
Thanks.
|
|
Author:
{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}
Message: SERVER-22133 generate COUNT_SCAN plans properly from the plan cache
Branch: master
https://github.com/mongodb/mongo/commit/dac2f722f8aaec2917dce98862d58abde2605f7a
|
|
Dave, you are absolutely correct. The SERVER-22093 is in fact what I am seeing with the simple aggregation query examples that I provided with explain plans.
I reviewed the comments in SERVER-22093 and it seems that as written that ticket addresses a sub-case of a bigger issue alluded to by @redbeard0531 in https://jira.mongodb.org/browse/SERVER-22093?focusedCommentId=1129757&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1129757.
The real solution is for aggregation/find to realize that all the fields needed for future stages/projection are in fact already in an index and avoid the FETCH stage entirely or perform a more specific FETCH_INDEX stage (which would only retrieve data already loaded with the index). This would cover the more limited case of count as well as provide an optimization for all the aggregations which I am doing where the index contains all the fields.
So, fixing COUNT is actually a limited subset of a "skip-fetch-if-all-fields-are-already-covered-by-index" issue in the plan generation pipeline. This is what I think the essence of a new ticket should be (actually instead of SERVER-22093).
Thanks.
|
|
Hi apara,
Great question. The .find() operation takes over 80 seconds because it cannot be covered without a user-supplied projection excluding all non-indexed fields. However, you are absolutely right that there is no reason that your count operation using the .aggregate() command cannot be covered. The aggregation system does try to be smart about analyzing which fields it needs to compute its results, projecting out all other fields. In this particular case it identifies that it does not need any fields from the user documents and generates a special "no fields needed projection". However, the underlying query machinery does not understand this special projection, resulting in a non-covered query plan. This is a known bug described by SERVER-22093. I suspect that, due to this known issue, your aggregation is fetching documents from the collection, causing it to be much slower than the covered COUNT_SCAN plan. If you would like to definitely confirm that this is the case, please post the output from explaining your aggregation pipeline.
So, to summarize, there are basically three issues at play here:
- COUNT_SCAN plans are generated incorrectly from the plan cache. This is the originally reported issue, and it is the one that will be addressed by a code fix under this ticket.
- Aggregation cannot generate covered plans for pipelines which are pure count operations. This issue is described in
SERVER-22093.
- Count operations described using the an aggregation pipeline cannot use COUNT_SCAN. As I mentioned before, this is more of a feature request than a bug. To my knowledge, there is no ticket describing this work. I can file one on your behalf if you like.
Best,
Dave
|
|
Thanks David. The real question is if all my data is in an index and COUNT_SCAN can match all that data in under 200ms why does it take over 80 seconds for IDX_SCAN to do the same thing? Again, we are not using any data which is not covered by the index so aggregation is running on the fields which are ALL covered by the index.
We are only talking about a data sub set of about 232,691 documents.
The 200 milliseconds vs 80 seconds is a significant performance difference.
|
|
Hi apara,
Good news. I think I have been able to develop a simple repro script which demonstrates what is happening:
// Set up data and indexes.
|
db.c.drop();
|
for (var i = 1; i <= 6; i++) { db.c.insert({a: 1, b: 1, c: i}); }
|
db.c.ensureIndex({a: 1, b: 1, c: 1, d: 1});
|
db.c.ensureIndex({a: 1, z: 1, c: 1});
|
|
// Enable logLevel 1 so that "planSummary" strings appear in the logs.
|
db.setLogLevel(1);
|
|
// Logs should show a COUNT_SCAN plan.
|
db.c.count({a: 1, b: 1, c: {$gt: 4}});
|
|
// Create a plan cache entry using the same predicate.
|
db.c.find({a: 1, b: 1, c: {$gt: 4}}).itcount();
|
|
// This time IXSCAN will show up as the planSummary, but it should be COUNT_SCAN.
|
db.c.count({a: 1, b: 1, c: {$gt: 4}});
|
The problem has to do with how we use the plan cache to construct the execution plans for count queries. The fast count operations which you observe are using a special COUNT_SCAN plan, which is an optimized fast-path for answering count operations. The fast path is covered, which means that it can execute by reading indexes only, without ever fetching data from the collection. The fix will allow count operations coming from the plan cache to properly make use of a COUNT_SCAN plan. However, note that aggregations cannot use COUNT_SCAN, and that a fix for this ticket will not allow use of COUNT_SCAN for aggregation operations. While this would be a good way to optimize aggregation pipelines like yours, the work to make that happen is complex (closer to a new feature than a bug fix).
I think we have enough information from you, so I am going to move this ticket out of the "debugging with submitter" state. Please continue to watch for updates as work on this progresses.
Best,
Dave
|
|
Hi apara,
Apologies for the delayed response; things have been busy here. I'm planning on looking into this more today. It sounds like from the information you've already provided we have some leads, but if I need any more information from you I'll definitely ping you on this ticket. Thanks for your willingness to help us track this down!
Best,
Dave
|
|
David, let me know how I can help with debugging of this issue. We are currently setting up another SECONDARY replica so that we may be able to stop/restart the instance in question without effecting our users. We can setup a time or I can do some more tests.
|
|
Here is another example:
Jan 12 16:20:33 ec2-52-23-180-212 mongod.27000[29049]: [conn14780] command cdc.$cmd command: aggregate { aggregate: "events-qos-loadstart", pipeline: [ { $match: { vd.ple.vd.acc: "EY", vd.ple.vd.pid: "313c0296-5469-59f7-7cbe-5b818a2e657c", vd.pec: { $gte: 2.0 } } }, { $group: { _id: {}, count: { $sum: 1.0 } } } ] } keyUpdates:0 writeConflicts:0 numYields:5174 reslen:83 locks:{ Global: { acquireCount: { r: 10354 }, acquireWaitCount: { r: 1563 }, timeAcquiringMicros: { r: 11693027 } }, Database: { acquireCount: { r: 5177 } }, Collection: { acquireCount: { r: 5177 } } } 98170ms
|
|
Jan 12 16:22:59 ec2-52-23-180-212 mongod.27000[29049]: [conn14780] command cdc.$cmd command: count { count: "events-qos-loadstart", query: { vd.ple.vd.pid: "313c0296-5469-59f7-7cbe-5b818a2e657c", vd.ple.vd.acc: "EY", vd.pec: { $gte: 2.0 } }, fields: {} } planSummary: IXSCAN { vd.ple.vd.acc: 1, vd.ple.vd.pid: 1, vd.pec: 1, vd.ts: 1 } keyUpdates:0 writeConflicts:0 numYields:4493 reslen:44 locks:{ Global: { acquireCount: { r: 8988 }, acquireWaitCount: { r: 1195 }, timeAcquiringMicros: { r: 10605847 } }, Database: { acquireCount: { r: 4494 } }, Collection: { acquireCount: { r: 4494 } } } 81287ms
|
|
CLEAR PLAN
|
|
Jan 12 16:23:24 ec2-52-23-180-212 mongod.27000[29049]: [conn14780] command cdc.$cmd command: count { count: "events-qos-loadstart", query: { vd.ple.vd.pid: "313c0296-5469-59f7-7cbe-5b818a2e657c", vd.ple.vd.acc: "EY", vd.pec: { $gte: 2.0 } }, fields: {} } planSummary: COUNT_SCAN { vd.ple.vd.acc: 1, vd.ple.vd.pid: 1, vd.pec: 1, vd.ts: 1 } keyUpdates:0 writeConflicts:0 numYields:3058 reslen:44 locks:{ Global: { acquireCount: { r: 6118 } }, Database: { acquireCount: { r: 3059 } }, Collection: { acquireCount: { r: 3059 } } } 136ms
|
|
|
|
At this point I can re-create the slow find issue on demand simply by running an aggregation. After the aggregation runs, the FIND command goes back to running slow.
|
|
Here is the REAL problem that I am trying to solve.
When I execute this command:
db.runCommand(
|
{
|
"aggregate": "events-qos-loadstart",
|
"pipeline": [{
|
"$match": {
|
"vd.ple.vd.acc": "EY",
|
"vd.ple.vd.pid": "313c0296-5469-59f7-7cbe-5b818a2e657c",
|
"vd.pec": {"$gte": 2}
|
}
|
},
|
{
|
"$group": {
|
"_id": {
|
"acc" : "$vd.ple.vd.acc",
|
},
|
"rR8Rr": {
|
"$sum": 1
|
}
|
}
|
}]
|
}
|
);
|
It takes quite a while to run (over a minute!) :
Jan 12 15:58:35 ec2-52-23-180-212 mongod.27000[29049]: [conn14780] command cdc.$cmd command: aggregate { aggregate: "events-qos-loadstart", pipeline: [ { $match: { vd.ple.vd.acc: "EY", vd.ple.vd.pid: "313c0296-5469-59f7-7cbe-5b818a2e657c", vd.pec: { $gte: 2.0 } } }, { $group: { _id: { acc: "$vd.ple.vd.acc" }, rR8Rr: { $sum: 1.0 } } } ] } keyUpdates:0 writeConflicts:0 numYields:4490 reslen:95 locks:{ Global: { acquireCount: { r: 9098 }, acquireWaitCount: { r: 1043 }, timeAcquiringMicros: { r: 7052888 } }, Database: { acquireCount: { r: 4549 } }, Collection: { acquireCount: { r: 4549 } } } 78725ms
|
If I execute the find command, right after this, it takes about the same time to run:
db.getCollection('events-qos-loadstart').find({"vd.ple.vd.pid": "313c0296-5469-59f7-7cbe-5b818a2e657c","vd.ple.vd.acc": "EY", "vd.pec": {"$gte": 2}}).size()
|
|
Jan 12 16:01:57 ec2-52-23-180-212 mongod.27000[29049]: [conn14780] command cdc.$cmd command: count { count: "events-qos-loadstart", query: { vd.ple.vd.pid: "313c0296-5469-59f7-7cbe-5b818a2e657c", vd.ple.vd.acc: "EY", vd.pec: { $gte: 2.0 } }, fields: {} } planSummary: IXSCAN { vd.ple.vd.acc: 1, vd.ple.vd.pid: 1, vd.pec: 1, vd.ts: 1 } keyUpdates:0 writeConflicts:0 numYields:4544 reslen:44 locks:{ Global: { acquireCount: { r: 9090 }, acquireWaitCount: { r: 1304 }, timeAcquiringMicros: { r: 10582063 } }, Database: { acquireCount: { r: 4545 } }, Collection: { acquireCount: { r: 4545 } } } 80866ms
|
So, you could say that it's taking a long time to get through the index. However, if I clear the plans the find runs fast:
Jan 12 16:02:49 ec2-52-23-180-212 mongod.27000[29049]: [conn14780] command cdc.$cmd command: count { count: "events-qos-loadstart", query: { vd.ple.vd.pid: "313c0296-5469-59f7-7cbe-5b818a2e657c", vd.ple.vd.acc: "EY", vd.pec: { $gte: 2.0 } }, fields: {} } planSummary: COUNT_SCAN { vd.ple.vd.acc: 1, vd.ple.vd.pid: 1, vd.pec: 1, vd.ts: 1 } keyUpdates:0 writeConflicts:0 numYields:3058 reslen:44 locks:{ Global: { acquireCount: { r: 6118 }, acquireWaitCount: { r: 42 }, timeAcquiringMicros: { r: 92598 } }, Database: { acquireCount: { r: 3059 } }, Collection: { acquireCount: { r: 3059 } } } 228ms
|
The aggregate command, however, still runs very slow. After the aggregate command executes, the find goes back to running slow until the plans are cleared.
It may sounds like we have a lot of data and the find simply takes a long time to scan the index, but it does the "same" scan when it does the count after the clearing of the plan and executes in under half a second.
Also, it seems that aggregation creates the plans and once the plans are created things run slow.
Why?
|
|
It seems that after the plans are created, the query that used to run fast now runs slow...
|
|
I re-executed the following code after clearing out the plans:
db.getCollection('events-qos-loadstart').getPlanCache().getPlansByQuery({"vd.ple.vd.pid": "57de9139-cc7e-4b5c-51fd-aaa8517028f0","vd.ple.vd.acc": "EY"},{},{})
|
db.getCollection('events-qos-loadstart').getPlanCache().getPlansByQuery({"vd.ple.vd.pid": "57de9139-cc7e-4b5c-51fd-aaa8517028f0","vd.ple.vd.acc": "EY", "vd.pec": {"$gte": 2}},{},{})
|
Neither produced any results.
|
|
So, I ran this query and it was FAST:
db.getCollection('events-qos-loadstart').count({"vd.ple.vd.pid": "57de9139-cc7e-4b5c-51fd-aaa8517028f0","vd.ple.vd.acc": "EY"})
|
Then, I ran this query and it was SLOW:
db.getCollection('events-qos-loadstart').count({"vd.ple.vd.pid": "57de9139-cc7e-4b5c-51fd-aaa8517028f0","vd.ple.vd.acc": "EY", "vd.pec": {"$gte": 2}})
|
I looked at the server log to see the slow query logger return:
Jan 12 14:50:26 ec2-52-23-180-212 mongod.27000[29049]: [conn14780] command cdc.$cmd command: count { count: "events-qos-loadstart", query: { vd.ple.vd.pid: "57de9139-cc7e-4b5c-51fd-aaa8517028f0", vd.ple.vd.acc: "EY", vd.pec: { $gte: 2.0 } }, fields: {} } planSummary: IXSCAN { vd.ple.vd.acc: 1, vd.ple.vd.pid: 1, vd.pec: 1, vd.ts: 1 } keyUpdates:0 writeConflicts:0 numYields:2737 reslen:44 locks:{ Global: { acquireCount: { r: 5476 }, acquireWaitCount: { r: 651 }, timeAcquiringMicros: { r: 6638046 } }, Database: { acquireCount: { r: 2738 } }, Collection: { acquireCount: { r: 2738 } } } 45219ms
|
As you have suggested, I executed the following for the first query that ran fast:
db.getCollection('events-qos-loadstart').getPlanCache().getPlansByQuery({"vd.ple.vd.pid": "57de9139-cc7e-4b5c-51fd-aaa8517028f0","vd.ple.vd.acc": "EY"},{},{})
|
This did not produce anything:
Then I executed the same for the query that ran slow:
db.getCollection('events-qos-loadstart').getPlanCache().getPlansByQuery({"vd.ple.vd.pid": "57de9139-cc7e-4b5c-51fd-aaa8517028f0","vd.ple.vd.acc": "EY", "vd.pec": {"$gte": 2}},{},{})
|
This resulted in the following output:
[
|
{
|
"details" : {
|
"solution" : "(index-tagged expression tree: tree=Node\n---Leaf { vd.ple.vd.acc: 1, vd.ple.vd.pid: 1, vd.pec: 1, vd.ts: 1 }, pos: 0\n---Leaf { vd.ple.vd.acc: 1, vd.ple.vd.pid: 1, vd.pec: 1, vd.ts: 1 }, pos: 1\n---Leaf { vd.ple.vd.acc: 1, vd.ple.vd.pid: 1, vd.pec: 1, vd.ts: 1 }, pos: 2\n)"
|
},
|
"reason" : {
|
"score" : 2.0003,
|
"stats" : {
|
"stage" : "FETCH",
|
"nReturned" : 101,
|
"executionTimeMillisEstimate" : 0,
|
"works" : 101,
|
"advanced" : 101,
|
"needTime" : 0,
|
"needFetch" : 0,
|
"saveState" : 1,
|
"restoreState" : 1,
|
"isEOF" : 0,
|
"invalidates" : 0,
|
"docsExamined" : 101,
|
"alreadyHasObj" : 0,
|
"inputStage" : {
|
"stage" : "IXSCAN",
|
"nReturned" : 101,
|
"executionTimeMillisEstimate" : 0,
|
"works" : 101,
|
"advanced" : 101,
|
"needTime" : 0,
|
"needFetch" : 0,
|
"saveState" : 1,
|
"restoreState" : 1,
|
"isEOF" : 0,
|
"invalidates" : 0,
|
"keyPattern" : {
|
"vd.ple.vd.acc" : 1,
|
"vd.ple.vd.pid" : 1,
|
"vd.pec" : 1,
|
"vd.ts" : 1
|
},
|
"indexName" : "vd.ple.vd.acc_1_vd.ple.vd.pid_1_vd.pec_1_vd.ts_1",
|
"isMultiKey" : false,
|
"direction" : "forward",
|
"indexBounds" : {
|
"vd.ple.vd.acc" : [
|
"[\"EY\", \"EY\"]"
|
],
|
"vd.ple.vd.pid" : [
|
"[\"57de9139-cc7e-4b5c-51fd-aaa8517028f0\", \"57de9139-cc7e-4b5c-51fd-aaa8517028f0\"]"
|
],
|
"vd.pec" : [
|
"[2, inf.0]"
|
],
|
"vd.ts" : [
|
"[MinKey, MaxKey]"
|
]
|
},
|
"keysExamined" : 101,
|
"dupsTested" : 0,
|
"dupsDropped" : 0,
|
"seenInvalidated" : 0,
|
"matchTested" : 0
|
}
|
}
|
},
|
"feedback" : {
|
"nfeedback" : 20,
|
"averageScore" : NaN,
|
"stdDevScore" : NaN,
|
"scores" : [
|
{
|
"score" : NaN
|
},
|
{
|
"score" : 1.9996768236380424
|
},
|
{
|
"score" : NaN
|
},
|
{
|
"score" : 1.9996768236380424
|
},
|
{
|
"score" : NaN
|
},
|
{
|
"score" : NaN
|
},
|
{
|
"score" : 1.9996768236380424
|
},
|
{
|
"score" : 1.9996768236380424
|
},
|
{
|
"score" : NaN
|
},
|
{
|
"score" : 1.9996768236380424
|
},
|
{
|
"score" : NaN
|
},
|
{
|
"score" : 1.9996768236380424
|
},
|
{
|
"score" : NaN
|
},
|
{
|
"score" : NaN
|
},
|
{
|
"score" : 1.9996768236380424
|
},
|
{
|
"score" : 1.9996768236380424
|
},
|
{
|
"score" : NaN
|
},
|
{
|
"score" : 1.9996768236380424
|
},
|
{
|
"score" : NaN
|
},
|
{
|
"score" : NaN
|
}
|
]
|
},
|
"filterSet" : false
|
},
|
{
|
"details" : {
|
"solution" : "(index-tagged expression tree: tree=Node\n---Leaf { vd.ple.vd.acc: 1, vd.ts: 1, vd.pec: 1 }, pos: 0\n---Leaf \n---Leaf { vd.ple.vd.acc: 1, vd.ts: 1, vd.pec: 1 }, pos: 2\n)"
|
},
|
"reason" : {
|
"score" : 1.0003,
|
"stats" : {
|
"stage" : "FETCH",
|
"filter" : {
|
"vd.ple.vd.pid" : {
|
"$eq" : "57de9139-cc7e-4b5c-51fd-aaa8517028f0"
|
}
|
},
|
"nReturned" : 0,
|
"executionTimeMillisEstimate" : 0,
|
"works" : 101,
|
"advanced" : 0,
|
"needTime" : 101,
|
"needFetch" : 0,
|
"saveState" : 1,
|
"restoreState" : 1,
|
"isEOF" : 0,
|
"invalidates" : 0,
|
"docsExamined" : 91,
|
"alreadyHasObj" : 0,
|
"inputStage" : {
|
"stage" : "IXSCAN",
|
"nReturned" : 91,
|
"executionTimeMillisEstimate" : 0,
|
"works" : 101,
|
"advanced" : 91,
|
"needTime" : 10,
|
"needFetch" : 0,
|
"saveState" : 1,
|
"restoreState" : 1,
|
"isEOF" : 0,
|
"invalidates" : 0,
|
"keyPattern" : {
|
"vd.ple.vd.acc" : 1,
|
"vd.ts" : 1,
|
"vd.pec" : 1
|
},
|
"indexName" : "vd.ple.vd.acc_1_vd.ts_1_vd.pec_1",
|
"isMultiKey" : false,
|
"direction" : "forward",
|
"indexBounds" : {
|
"vd.ple.vd.acc" : [
|
"[\"EY\", \"EY\"]"
|
],
|
"vd.ts" : [
|
"[MinKey, MaxKey]"
|
],
|
"vd.pec" : [
|
"[2, inf.0]"
|
]
|
},
|
"keysExamined" : 101,
|
"dupsTested" : 0,
|
"dupsDropped" : 0,
|
"seenInvalidated" : 0,
|
"matchTested" : 0
|
}
|
}
|
},
|
"feedback" : {
|
|
},
|
"filterSet" : false
|
}
|
]
|
Then as you have suggested, I cleared the plans:
db.getCollection('events-qos-loadstart').getPlanCache().clear();
|
Suddenly, both of the queries are executing FAST.
So, clearing out the plans seemed to have worked.
|
|
Thomas, note that the two queries are different by:
|
|
Hi apara,
Thank you for the report. We would like to determine if this issue is related to plan caching.
When you are experiencing these slow queries can you please take the following steps. I have provided example commands that match the queries you have posted previously.
- Reconfirm that the query explain selects COUNT_SCAN and executes quickly:
db.getCollection('events-qos-loadstart').explain().count({"vd.ple.vd.pid": "57de9139-cc7e-4b5c-51fd-aaa8517028f0","vd.ple.vd.acc": "EY"})
|
- Execute PlanCache.getPlansByQuery with the same query with an empty projection and sort. Please post this output in the ticket:
db.getCollection('events-qos-loadstart').getPlanCache().getPlansByQuery({"vd.ple.vd.pid": "57de9139-cc7e-4b5c-51fd-aaa8517028f0","vd.ple.vd.acc": "EY"}, {}, {})
|
If the plan cache outputs an IXSCAN rather than a COUNT_SCAN, execute PlanCache.clear() and confirm if the query speed is improved:
db.getCollection('events-qos-loadstart').getPlanCache().clear()
|
db.getCollection('events-qos-loadstart').count({"vd.ple.vd.pid": "57de9139-cc7e-4b5c-51fd-aaa8517028f0","vd.ple.vd.acc": "EY"})
|
Regards,
Thomas
|
|
Here is the CloudManager group that this server is in:
https://cloud.mongodb.com/v2/5582d749e4b0b632f89d47c9#deployment/topology
|
|
More details can be found here: https://groups.google.com/forum/#!topic/mongodb-user/ojgOflygt5g
|
Generated at Thu Feb 08 03:59:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.