[SERVER-4566] In new aggregation framework, $sort, $limit in the pipeline seems loading all the matched data into memory. When we tried to improve the performance by leveraging multi thread aggregation, this makes it much slower than single thread Created: 27/Dec/11  Updated: 11/Jul/16  Resolved: 30/Jan/12

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 2.1.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Xiaofeng Wu Assignee: Chris Westin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OSX 10.6, Java, mongodb is built from Dec. 15 master branch.


Issue Links:
Depends
depends on SERVER-3832 aggregation: early $sort should be o... Closed
is depended on by SERVER-4507 aggregation: optimize $group to take... Backlog
Operating System: OS X
Participants:

 Description   

Overall new aggregation framework is faster than map-reduce in our application. But when I tried to further improve the performance by sending multiple aggregation commands in multiple threads, then get the final aggregation done in app space, I noticed each thread will load lot of data to memory which makes each command take minutes to finish. (I do have index on sort key)



 Comments   
Comment by Chris Westin [ 30/Jan/12 ]

Short-term fix was SERVER-3832; other improvements listed are further off in the future.

Comment by Chris Westin [ 29/Dec/11 ]

No, you want to specify the sort, but the optimization I'm referring to will convert that into an index scan, so the data won't be loaded into memory to be sorted.

Given the two indexes you have, they will compete over the predicate on sId vs the sort on intId. The predicate will want to use the composite index, while the sort will want to use the plain intId index. At present, it seems like the optimizer will choose the composite index, which means you'll still have to sort all the data in order to do it this way. If you have multiple $sort/$skip/$limit threads running in parallel, they will all do that. You may have to wait for $out to collect the output of the sort after it has been done once, and then use that as the input for the next stage.

For this particular case, there's a different optimization that seems like it would be a better idea. If there were a composite index on <sId, aid>, then we could take advantage of SERVER-4507 (but note that's further off in the future).

Comment by Xiaofeng Wu [ 29/Dec/11 ]

Thanks for your explanation. I have a composite index for sId and intId, and another index for intId. Can I just skip $sort?

Comment by Chris Westin [ 28/Dec/11 ]

This is not surprising, a sort requires all the data in order to be carried out. However, if you have the required indexes, that step can be skipped, and the optimization described by SERVER-3832 (linked-to above) will come into play. That's not very difficult, but I haven't had a chance to finish it yet. It should be done in the next week or two, barring interruptions.

Just to be sure that optimization will apply in this case, can you tell me what indexes you've got on the surveyResponse collection?

Comment by Xiaofeng Wu [ 28/Dec/11 ]

I have tested with 3 threads, 4 threads.

Comment by Eliot Horowitz (Inactive) [ 28/Dec/11 ]

How many were you running in parallel?

Comment by Xiaofeng Wu [ 28/Dec/11 ]

full command for multi thread aggregation:

db.runCommand( { aggregate : "surveyResponse", pipeline : [
{ $match : { sId: "4ed52f6601bf2abf47697b8c", excluded: 0, $nor: [

{ status: -99 }

,

{ status: -50 }

,

{ status: -60 }

] } },
{ $sort: {intId : 1}},

{ $skip: 2000}

,

{ $limit:1000}

,
{ $project: {
responseId : "$_id",
intId:1,
status:1,
"questions":1}},

{ $unwind : "$questions"}

,

{ $unwind : "$questions.answers"}

,
{ $group : { _id:"$aid",
count : {$sum : 1},
}}
]});

full command without using multi thread:

db.runCommand( { aggregate : "surveyResponse", pipeline : [
{ $match : { sId: "4ed52f6601bf2abf47697b8c", excluded: 0, $nor: [

{ status: -99 }

,

{ status: -50 }

,

{ status: -60 }

] } },
{ $project: {
responseId : "$_id",
intId:1,
status:1,
"questions":1}},

{ $unwind : "$questions"}

,

{ $unwind : "$questions.answers"}

,
{ $group : { _id:"$aid",
count : {$sum : 1},
}}
]});

Comment by Eliot Horowitz (Inactive) [ 28/Dec/11 ]

Can you send the full command?
What you have before doesn't need map/reduce or pipeline.

Comment by Xiaofeng Wu [ 27/Dec/11 ]

Here're the commands we generated in different threads:
db.runCommand( { aggregate : "myCollection", pipeline : [
{ $match : { sId: "4ed52f6601bf2abf47697b8c", excluded: 0, $nor: [

{ status: -99 }

,

{ status: -50 }

,

{ status: -60 }

] } },
{ $sort: {intId : 1}},

{ $limit:1000}

,
{ $project: {...}},
..]});

db.runCommand( { aggregate : "myCollection", pipeline : [
{ $match : { sId: "4ed52f6601bf2abf47697b8c", excluded: 0, $nor: [

{ status: -99 }

,

{ status: -50 }

,

{ status: -60 }

] } },
{ $sort: {intId : 1}},

{ $skip: 1000}

,

{ $limit:1000}

,
{ $project: {...}},
..]});

db.runCommand( { aggregate : "myCollection", pipeline : [
{ $match : { sId: "4ed52f6601bf2abf47697b8c", excluded: 0, $nor: [

{ status: -99 }

,

{ status: -50 }

,

{ status: -60 }

] } },
{ $sort: {intId : 1}},

{ $skip: 2000}

,

{ $limit:1000}

,
{ $project: {...}},
..]});

Generated at Thu Feb 08 03:06:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.