Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-15818

mapReduce sort on different field than query is too slow

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.6.0
    • Affects Version/s: 2.4.12
    • Component/s: MapReduce
    • None
    • Fully Compatible
    • ALL
    • Hide

      1. Create Test Data with 2 different value sets.

      for (var i = 0; i < 1000000; ++i){ db.uniques.insert({ dim0: Math.floor(Math.random()*1000000), dim1: Math.floor(Math.random()*1000000) });}
      db.uniques.ensureIndex({dim0: 1})
      db.uniques.ensureIndex({dim1: 1})
      db.uniques.count({dim1: {$gte:  500000, $lte: 501000} }); //should be approx 1000 records
      

      2. Do a map reduce that should be rather quick:

      db.runCommand({ mapreduce: "uniques", map: function () { emit(this.dim0, 1); }, reduce: function (key, values) { return Array.sum(values); }, out: "mrout", query: {dim1: {$gte: 500000, $lte: 501000} } } )
      // (nearly instant)
      

      3. Add a sort on the same field you are filtering on:

      db.runCommand({ mapreduce: "uniques", map: function () { emit(this.dim0, 1); }, reduce: function (key, values) { return Array.sum(values); }, out: "mrout", sort: {dim1: 1}, query: {dim1: {$gte: 500000, $lte: 501000} } } )
      // (Also runs nearly instant, even though it is sorted wrong for our map function!)
      

      4. Add a sort on a different field

      db.runCommand({ mapreduce: "uniques", map: function () { emit(this.dim0, 1); }, reduce: function (key, values) { return Array.sum(values); }, out: "mrout", sort: {dim0: 1}, query: {dim1: {$gte: 500000, $lte: 501000} } } )
      // (Orders of magnitude slower!)
      
      Show
      1. Create Test Data with 2 different value sets. for ( var i = 0; i < 1000000; ++i){ db.uniques.insert({ dim0: Math .floor( Math .random()*1000000), dim1: Math .floor( Math .random()*1000000) });} db.uniques.ensureIndex({dim0: 1}) db.uniques.ensureIndex({dim1: 1}) db.uniques.count({dim1: {$gte: 500000, $lte: 501000} }); //should be approx 1000 records 2. Do a map reduce that should be rather quick: db.runCommand({ mapreduce: "uniques" , map: function () { emit( this .dim0, 1); }, reduce: function (key, values) { return Array.sum(values); }, out: "mrout" , query: {dim1: {$gte: 500000, $lte: 501000} } } ) // (nearly instant) 3. Add a sort on the same field you are filtering on: db.runCommand({ mapreduce: "uniques" , map: function () { emit( this .dim0, 1); }, reduce: function (key, values) { return Array.sum(values); }, out: "mrout" , sort: {dim1: 1}, query: {dim1: {$gte: 500000, $lte: 501000} } } ) // (Also runs nearly instant, even though it is sorted wrong for our map function!) 4. Add a sort on a different field db.runCommand({ mapreduce: "uniques" , map: function () { emit( this .dim0, 1); }, reduce: function (key, values) { return Array.sum(values); }, out: "mrout" , sort: {dim0: 1}, query: {dim1: {$gte: 500000, $lte: 501000} } } ) // (Orders of magnitude slower!)

      When performing a mapReduce operation over a large dataset, and constraining the dataset with a different field than the sort field, the sort operation adds significant (Orders of magnitude) more time to the operation.

      (This occurred in production with a date-bound filter on a large collection that was sorted by a text field.)

            Assignee:
            Unassigned Unassigned
            Reporter:
            wshaver Will Shaver
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: