[SERVER-15291] slow '$group' performance Created: 17/Sep/14  Updated: 26/Sep/14  Resolved: 26/Sep/14

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 2.6.1
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Baruch Oxman Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-11447 aggregation can sort using index to s... Closed
Related
related to SERVER-9507 Optimize $sort+$group+$first pipeline... Closed
Participants:

 Description   

Seems that some $group queries that can be made using index-only scans, are performing unnecessary full table scans.

Here are more details, As I originally described in http://stackoverflow.com/questions/24980525/mongodb-slow-group-performance.

I have a MongoDB collection of over 1,000,000 records. Each record size is around 20K (so the total collection size is around 20GB).

I have a 'type' field in the collection (that can have around 10 different values). I would like to get the per-type counters for the collection. Also, there is an index on the 'type' field.

I've tested two different approaches (assume python syntax):

  • A naive method - using 'count' calls for each of the values:

    for type_val in my_db.my_colc.distinct('type'):
        counters[type_val] = my_db.my_colc.find({'type' : type_val}).count()

  • Using aggregation framework with a '$group' syntax:

    counters = my_db.my_colc.aggregate([{'$group' :  {'_id': '$type', 'agg_val': { '$sum': 1 } }}])

The performance I'm receiving for the first approach is about 2 orders of magnitude faster than for the 2nd approach. Seems to be related to the fact that count runs on the indices only, without accessing the documents, while $group has to go over the documents one-by-one. (It's about 1min vs. 45mins).

Is there any way to run an efficient grouping query on the 'type' index, that would use only the index, thus achieving the performance results from #1, but using the aggregation framework ?



 Comments   
Comment by Ramon Fernandez Marina [ 26/Sep/14 ]

Sorry, that was a typo, thanks for catching that baruchoxman. The right ticket is SERVER-11447, as listed in the "Issue Links" section. I've edited my previous comment as well to avoid further confusion.

Comment by Baruch Oxman [ 26/Sep/14 ]

Raul, are you use this is the correct duplicate ticket number ?
The title for SERVER-11477 is "RHEL 5.7 durability - closeall.js failure"...

Comment by Ramon Fernandez Marina [ 26/Sep/14 ]

baruchoxman, this ticket is a duplicate of SERVER-11447. Feel free to tune in there for updates.

Generated at Thu Feb 08 03:37:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.