[SERVER-4961] $group is taking 2x as long as collection.group() Created: 13/Feb/12  Updated: 11/Jul/16  Resolved: 27/Aug/12

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 2.1.0
Fix Version/s: 2.2.0-rc2

Type: Bug Priority: Major - P3
Reporter: Daniel Pasette (Inactive) Assignee: Mathias Stearn
Resolution: Done Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-5795 Very Poor Performances Closed
Duplicate
is duplicated by SERVER-5361 early $group should provide a hint to... Closed
Related
is related to SERVER-447 new aggregation framework Closed
is related to SERVER-4507 aggregation: optimize $group to take... Backlog
Operating System: ALL
Participants:

 Description   

This came from a DISQUS comment on the "Aggregation Framework" page:
I'm testing MongoDB 2.1.0 in order to evaluate the performance of the new aggregation framework. I'm wondering why it's 2x slower in my use case.

Here is the code I used before version 2.1.0 (using Python and pymongo):

db.customers.group(

{'segment': True}

, None,

{'count': 0}

, "function (obj, prev)

{ prev.count ++; }

" )

Here is the same computation using the new aggregation framework:

db.command('aggregate', 'customers', pipeline=[ {'$group' : { '_id': '$segment', 'count':

{'$sum': 1}

}} ])

On my computer with my dataset, the first version runs in ~1 s, the second version in ~2.5 s. Is it expected or am I doing something wrong?



 Comments   
Comment by Mathias Stearn [ 27/Aug/12 ]

There were several optimizations to aggregation before 2.2.0. I just did a test on my machine with 5 million docs, each with a random segment from 0 to 999 and it took 4 seconds with aggregate and 24 seconds with the js-based group. Please file a new ticket if you can find a case of performance regression using the 2.2 rc or final release.

Comment by Samuel García Martínez [ 02/Apr/12 ]

@Chris Westin, sorry, you are totally right. My comments are offtopic, but i tried to make a point about grouping optimization.

Thanks for creating the related issue

Comment by Chris Westin [ 02/Apr/12 ]

@Samuel Garcia Martinez: that's a different optimization than the one this ticket is about. For your suggestion, I've opened SERVER-5477 . This ticket is about being able to scan an index if it contains a prefix of the $group _id; that could be any index, not necessarily related to the shard key.

Comment by Samuel García Martínez [ 02/Apr/12 ]

Adding more info to my previous comment:

mongos> function testGroup() {
... var date = new Date();
... db.mycollection.aggregate({$match: { date: { $gte: ISODate('2010-09-01T00:00:00Z'), $lt: ISODate('2010-10-01T00:00:00Z')}}}, {$group: {_id:{query_hash:1},totalCount:{$sum:"$queryCount"}} },{$skip:1000},{$limit:10});
... print('Time ' + (new Date() - date));
... }
mongos> testGroup();
Time 2399

Adding log info from shards:

  • shard1:

    {{Mon Apr  2 11:48:03 [conn3] command mydb.$cmd command: { aggregate: "mycollection", pipeline: [ { $match: { date: { $gte: new Date(1283299200000), $lt: new Date(1285891200000) } } }, { $group: { _id: { query_hash: true }, totalCount: { $sum: "$queryCount" } } } ], fromRouter: true } ntoreturn:1 keyUpdates:0 reslen:3150640 1107ms }}

  • shard2:

    Mon Apr  2 11:48:45 [conn63] command mydb.$cmd command: { aggregate: "mycollection", pipeline: [ { $match: { date: { $gte: new Date(1283299200000), $lt: new Date(1285891200000) } } }, { $group: { _id: { query_hash: true }, totalCount: { $sum: "$queryCount" } } } ], fromRouter: true } ntoreturn:1 keyUpdates:0 reslen:3478690 1506ms

I think is important to remark that slower shard takes ~1.5s to answer, however mongos takes an additional second to build the result without a merge sort operation.

Comment by Samuel García Martínez [ 02/Apr/12 ]

On sharded environment and using early grouping, besides the use of an index, it would be nice that we be able to avoid the mongos regrouping process.

I'll try to explain that:

  • result_node1: [
    { id: "value1", totalcount: 50 }

    ,

    { id: "value2", totalcount: 100 }

    ]

  • result_node2: [ { id: "value1", totalcount: 60 }

    ]

The real results(after mongos regroup) must looks like:
[

{ id: "value1", totalcount: 110 }

,

{ id: "value2", totalcount: 100 }

]

But, in some cases, mongos regrouping process is nonsense since the grouping key is same as sharding key. So, never got same group key from different shards.

So, the prior example, now looks like:

  • result_node1: [ { id: "value1", totalcount: 110 }

    ]

  • result_node2: [ { id: "value2", totalcount: 100 }

    ]

The real results must looks like:
[

{ id: "value1", totalcount: 110 }

,

{ id: "value2", totalcount: 100 }

]

So, the point is mongos regrouping process is a waste of time when you group using same key as sharding key.

Comment by Chris Westin [ 13/Feb/12 ]

To the original poster: is there an index on "segment"?

Generated at Thu Feb 08 03:07:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.