[SERVER-4876] Map reduce with option "replace" is reducing instead Created: 06/Feb/12  Updated: 15/Aug/12  Resolved: 16/Mar/12

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: 2.0.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Grégoire Seux Assignee: Antoine Girbal
Resolution: Cannot Reproduce Votes: 0
Labels: mapreduce, options
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

linux centos


Operating System: ALL
Participants:

 Description   

When using map reduce over a large collection (several millions of documents) and setting output to "replace" the replace is not really an atomic replacement, it seems to "reduce" on the output collection.

I use a map reduce operation to find the duplicates (based on one field) in a sharded environement.
The input collection has several millions documents, the output also (they should have the same number of elements because there should not be any duplicates in theory).

However if I relaunch the map reduce (using the replace output option from the mongodb shell), a lot of a false positive are found (~800 on 17 millions documents are counted twice).
If I drop the ouput collection before re-running the map reduce, no duplicates are found.

function mapDoublonsSqlId() {
emit(

{p : this.partnerId, id : this.sqlId}

, 1)
}

function reduceDoublonsSqlId(key,values) {
var total = 0;
values.forEach(function(o)

{ total+=o}

)
return total;
}

db.runCommand({mapreduce : "products", map : mapDoublonsSqlId, reduce : reduceDoublonsSqlId, out : {replace : "tmp"}})
db.tmp.count({value : {$gt : 1}}) //ok no duplicates

db.runCommand({mapreduce : "products", map : mapDoublonsSqlId, reduce : reduceDoublonsSqlId, out : {replace : "tmp"}})
db.tmp.count({value : {$gt : 1}}) //oho here is the issue, a lot of false duplicates are displayed

db.tmp.drop()
db.runCommand({mapreduce : "products", map : mapDoublonsSqlId, reduce : reduceDoublonsSqlId, out : {replace : "tmp"}})
db.tmp.count({value : {$gt : 1}}) //ok no duplicates any more

It seems that the replace does not work as expected.



 Comments   
Comment by Grégoire Seux [ 16/Mar/12 ]

no it does not happen anymore. You can close this ticket.

Comment by Antoine Girbal [ 15/Mar/12 ]

are you still seeing this issue?
was it reproducible always?

Comment by Antoine Girbal [ 06/Feb/12 ]

I tried but cannot reproduce this issue, with v2.0.2 and 200k docs sharded collection.
Could you give:

  • exact version of all components you use (mongod, mongos, etc)
  • output of each MR job (should have stats)
  • db.tmp.stats() after each MR run
  • output of db.printShardingInfo()
  • does issue go away if you just set field like 'out: "tmp"'
Generated at Thu Feb 08 03:07:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.