Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-16544

mapReduce very slow with sort on emit key

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Critical - P2 Critical - P2
    • None
    • 2.6.5
    • MapReduce, Performance
    • Debian, MongoDB version: 2.6.5
    • Linux
    • Hide

      1) on mongo shell paste following to define a js function to output test documents with n fields

      function TestDoc (n) {
      	var doc={}; 
      	doc['lang']=['da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'nl', 'pt', 'ro', 'ru', 'sv', 'tr'][Math.floor(Math.random() * 13)]
      	for (var i = 0;
      	i < n; i++) { doc['fld_'+i]=Math.random().toString(34).slice(2)} 
      	return doc;
      }

      2) define a function to insert nDocs documents with nFields to a collection

      function InsertTestDocs (colName, nDocs, nFields) { 
      	for (var i = 0;
      	i < nNocs; i++) { doc=TestDoc(nFields); doc['_id']=i;
      	db[colName].insert(doc)
      	}
      }

      3) insert 1000000 test documents

      insertTestDocs("tmp_col", 1000000, 100)

      4) create index on 'lang' field

      db.tmp_col.ensureIndex({lang: 1})

      5) run a mapReduce job that simply sums distincts on lang field

      db.runCommand({ mapreduce: "tmp_col", map: function () { emit(this.lang, 1); }, reduce: function (key, values) { return Array.sum(values); }, out: { inline: 1 }})

      6) you get results of the following form

      "timeMillis" : 116705,
      	"counts" : {
      		"input" : 1000000,
      		"emit" : 1000000,
      		"reduce" : 65000,
      		"output" : 13
      	},
      	"ok" : 1

      7) run same map Reduce except specify a sort at this time

      db.runCommand({ mapreduce: "tmp_col", map: function () { emit(this.lang, 1); }, reduce: function (key, values) { return Array.sum(values); }, sort:{lang:1}, out: { inline: 1 }})

      7) you get following results

      "timeMillis" : 1478708,
      	"counts" : {
      		"input" : 1000000,
      		"emit" : 1000000,
      		"reduce" : 8474,
      		"output" : 13
      	},
      	"ok" : 1

      Notice that now it takes 1478708 instead of 116705 (that is ~10X slower) when run without sort option

      Show
      1) on mongo shell paste following to define a js function to output test documents with n fields function TestDoc (n) { var doc={}; doc['lang']=['da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'nl', 'pt', 'ro', 'ru', 'sv', 'tr'][Math.floor(Math.random() * 13)] for (var i = 0; i < n; i++) { doc['fld_'+i]=Math.random().toString(34).slice(2)} return doc; } 2) define a function to insert nDocs documents with nFields to a collection function InsertTestDocs (colName, nDocs, nFields) { for (var i = 0; i < nNocs; i++) { doc=TestDoc(nFields); doc['_id']=i; db[colName].insert(doc) } } 3) insert 1000000 test documents insertTestDocs("tmp_col", 1000000, 100) 4) create index on 'lang' field db.tmp_col.ensureIndex({lang: 1}) 5) run a mapReduce job that simply sums distincts on lang field db.runCommand({ mapreduce: "tmp_col", map: function () { emit(this.lang, 1); }, reduce: function (key, values) { return Array.sum(values); }, out: { inline: 1 }}) 6) you get results of the following form "timeMillis" : 116705, "counts" : { "input" : 1000000, "emit" : 1000000, "reduce" : 65000, "output" : 13 }, "ok" : 1 7) run same map Reduce except specify a sort at this time db.runCommand({ mapreduce: "tmp_col", map: function () { emit(this.lang, 1); }, reduce: function (key, values) { return Array.sum(values); }, sort:{lang:1}, out: { inline: 1 }}) 7) you get following results "timeMillis" : 1478708, "counts" : { "input" : 1000000, "emit" : 1000000, "reduce" : 8474, "output" : 13 }, "ok" : 1 Notice that now it takes 1478708 instead of 116705 (that is ~10X slower) when run without sort option

    Description

      Map Reduce operations become very slow (> 1 order of magnitude slower) when run with sort option on emit field.
      This is contrary to documentation. that states quite the oposite.
      Kindly note:
      1. that the delay is somehow proportional to number of fields on document and/or document complexity.
      2. Also does not seem to be affected if map reduce outputs inline or in a collection.
      3. I get about same results on a stand alone mongoDB and on a replicated one.

      Attachments

        Activity

          People

            ramon.fernandez@mongodb.com Ramon Fernandez Marina
            nickmilon Nick Milonakis
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: