[SERVER-4961] $group is taking 2x as long as collection.group() Created: 13/Feb/12 Updated: 11/Jul/16 Resolved: 27/Aug/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | 2.1.0 |
| Fix Version/s: | 2.2.0-rc2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Daniel Pasette (Inactive) | Assignee: | Mathias Stearn |
| Resolution: | Done | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
This came from a DISQUS comment on the "Aggregation Framework" page: Here is the code I used before version 2.1.0 (using Python and pymongo): db.customers.group( {'segment': True}, None, {'count': 0}, "function (obj, prev) { prev.count ++; }" ) Here is the same computation using the new aggregation framework: db.command('aggregate', 'customers', pipeline=[ {'$group' : { '_id': '$segment', 'count': {'$sum': 1}}} ]) On my computer with my dataset, the first version runs in ~1 s, the second version in ~2.5 s. Is it expected or am I doing something wrong? |
| Comments |
| Comment by Mathias Stearn [ 27/Aug/12 ] | |||||||||
|
There were several optimizations to aggregation before 2.2.0. I just did a test on my machine with 5 million docs, each with a random segment from 0 to 999 and it took 4 seconds with aggregate and 24 seconds with the js-based group. Please file a new ticket if you can find a case of performance regression using the 2.2 rc or final release. | |||||||||
| Comment by Samuel García Martínez [ 02/Apr/12 ] | |||||||||
|
@Chris Westin, sorry, you are totally right. My comments are offtopic, but i tried to make a point about grouping optimization. Thanks for creating the related issue | |||||||||
| Comment by Chris Westin [ 02/Apr/12 ] | |||||||||
|
@Samuel Garcia Martinez: that's a different optimization than the one this ticket is about. For your suggestion, I've opened SERVER-5477 . This ticket is about being able to scan an index if it contains a prefix of the $group _id; that could be any index, not necessarily related to the shard key. | |||||||||
| Comment by Samuel García Martínez [ 02/Apr/12 ] | |||||||||
|
Adding more info to my previous comment:
Adding log info from shards:
I think is important to remark that slower shard takes ~1.5s to answer, however mongos takes an additional second to build the result without a merge sort operation. | |||||||||
| Comment by Samuel García Martínez [ 02/Apr/12 ] | |||||||||
|
On sharded environment and using early grouping, besides the use of an index, it would be nice that we be able to avoid the mongos regrouping process. I'll try to explain that:
The real results(after mongos regroup) must looks like: , { id: "value2", totalcount: 100 }] But, in some cases, mongos regrouping process is nonsense since the grouping key is same as sharding key. So, never got same group key from different shards. So, the prior example, now looks like:
The real results must looks like: , { id: "value2", totalcount: 100 }] So, the point is mongos regrouping process is a waste of time when you group using same key as sharding key. | |||||||||
| Comment by Chris Westin [ 13/Feb/12 ] | |||||||||
|
To the original poster: is there an index on "segment"? |