[SERVER-15818] mapReduce sort on different field than query is too slow Created: 27/Oct/14  Updated: 03/Dec/14  Resolved: 03/Dec/14

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: 2.4.12
Fix Version/s: 2.6.0

Type: Bug Priority: Major - P3
Reporter: Will Shaver Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

1. Create Test Data with 2 different value sets.

for (var i = 0; i < 1000000; ++i){ db.uniques.insert({ dim0: Math.floor(Math.random()*1000000), dim1: Math.floor(Math.random()*1000000) });}
db.uniques.ensureIndex({dim0: 1})
db.uniques.ensureIndex({dim1: 1})
db.uniques.count({dim1: {$gte:  500000, $lte: 501000} }); //should be approx 1000 records

2. Do a map reduce that should be rather quick:

db.runCommand({ mapreduce: "uniques", map: function () { emit(this.dim0, 1); }, reduce: function (key, values) { return Array.sum(values); }, out: "mrout", query: {dim1: {$gte: 500000, $lte: 501000} } } )
// (nearly instant)

3. Add a sort on the same field you are filtering on:

db.runCommand({ mapreduce: "uniques", map: function () { emit(this.dim0, 1); }, reduce: function (key, values) { return Array.sum(values); }, out: "mrout", sort: {dim1: 1}, query: {dim1: {$gte: 500000, $lte: 501000} } } )
// (Also runs nearly instant, even though it is sorted wrong for our map function!)

4. Add a sort on a different field

db.runCommand({ mapreduce: "uniques", map: function () { emit(this.dim0, 1); }, reduce: function (key, values) { return Array.sum(values); }, out: "mrout", sort: {dim0: 1}, query: {dim1: {$gte: 500000, $lte: 501000} } } )
// (Orders of magnitude slower!)

Participants:

 Description   

When performing a mapReduce operation over a large dataset, and constraining the dataset with a different field than the sort field, the sort operation adds significant (Orders of magnitude) more time to the operation.

(This occurred in production with a date-bound filter on a large collection that was sorted by a text field.)



 Comments   
Comment by Ramon Fernandez Marina [ 03/Dec/14 ]

Thanks for your report wshaver. I do see the behavior you describe in MongoDB 2.4.12, but this issue has been addressed in the 2.6/2.8 series. Here's the output of the three mapReduce operations with 2.6.5:

{
        "result" : "mrout",
        "timeMillis" : 53,
        "counts" : {
                "input" : 965,
                "emit" : 965,
                "reduce" : 0,
                "output" : 965
        },
        "ok" : 1
}
{
        "result" : "mrout",
        "timeMillis" : 28,
        "counts" : {
                "input" : 965,
                "emit" : 965,
                "reduce" : 0,
                "output" : 965
        },
        "ok" : 1
}
{
        "result" : "mrout",
        "timeMillis" : 27,
        "counts" : {
                "input" : 965,
                "emit" : 965,
                "reduce" : 0,
                "output" : 965
        },
        "ok" : 1
}

Comment by Will Shaver [ 27/Oct/14 ]

Note: I tried to cleanup the reproduce steps with better markdown/markup but gave up when whatever I did made it worse. Maybe have your help system give the most obvious : markup your code like THIS - first.

Generated at Thu Feb 08 03:39:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.