[SERVER-5818] reduce in map reduce doesn't run with only one input document Created: 10/May/12  Updated: 11/Sep/13  Resolved: 10/May/12

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: 2.0.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Brian Johnson Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

centos 6


Issue Links:
Related
is related to SERVER-10736 Modify MapReduce to "map, shuffle, re... Closed
Operating System: ALL
Participants:

 Description   

If you run a map reduce and only emit a single document, the reduce doesn't get run. At first it seems like this would make sense since there's nothing to reduce, but the reduce often changes the document format (gets counts, etc...) and therefore the output you get will be wrong for those operations.



 Comments   
Comment by adam nisenbaum [ 11/May/12 ]

this seems like a bug. writing m/r results to an output collection skips writing documents where map emits one and only one key with one and only one value passed to reduce

Comment by Brian Johnson [ 10/May/12 ]

so, I've been thinking about it a lot, and I think you're right. If we make a change to our map function, we can make our reduce "re-reducable" and then we just need a finalize for the 1-(ab)/1-(cd) calculation

Comment by Eliot Horowitz (Inactive) [ 10/May/12 ]

For all of those cases you can use finalize.
For most cases, re-reducing is greatly superior.

Comment by Brian Johnson [ 10/May/12 ]

semantics? I'm talking about a class of problems that can't be solved using a re-reducable function. Now, I could see your point in a sharded environment where you might have to do a re-reduction on the sets returned by each shard, but again, there's not reason to require it. Even if it is sharded, you could potentially still avoid the issue depending on the map and your shard key choice. If you want to make the optimization, fine, but don't require it. Add a flag to turn it off. Why would you want to limit what your customers can do with your product? That just doesn't make any sense.

Comment by Eliot Horowitz (Inactive) [ 10/May/12 ]

re-reduce has massive performance and scalability benefits, as it allows for much tighter memory usage.
Given we want re-reduction for that, the optimization in the other case is irrelevant for semantics.

Comment by Brian Johnson [ 10/May/12 ]

reduces can be re-reducable, but there is no reason they have to be. Placing this restriction eliminates all kinds of useful things you can do with map-reduce.

in the above case, the _id really should have been key, so the document would be

{key:'val', etc...}

and the map would produce

{_id:

{key:'val'}

, value: {count_a: etc...

Even if you have some philosophical disagreement about what a reduce should be, that shouldn't limit what we can do with it. You can have the best of both worlds by adding a flag to the mapreduce function that allows you to turn off the "optimization" that was created in the other ticket.

Comment by Eliot Horowitz (Inactive) [ 10/May/12 ]

This doesn't work anyway if I understand as reduces can be reduced themselves.

What is the map in the above example? Don't see any map.

lets say you have to build an array of all values for a key, you have to do it this way:

emit( key , { values : [ value ] } }
reduce( k , values ) {
   all = []
   for ( var i=0; i<values.length; i++ ) {
      all = all.concat( values[i].values );
   }
   return { values : all } 
}

Comment by Brian Johnson [ 10/May/12 ]

here is a better example:

input documents

{_id: 'val', value: {count_a: 10, count_b: 20, count_c: 30, count_d: 40}}, {_id, 'val2', value: {count_a: 15, count_b: 20, count_c: 25, count_d: 50}}

same map as above

then reduce is

function(key, values) {
var ab = 1, cd = 1;

values.forEach(function(value)

{ ab *= 1 - (value.count_a/10*value.count_b) cd *= 1 - (value.count_c/10*value.count_d) }

);
return

{count: (1-ab)/(1-cd)}

;
}

We need to calculate this new value using all the existing value and that is the output we need. If we had a finalize, it would have to first check to see what format the document is and then do a partial calculation if the reduce wasn't run. Like I said, we can do it, but it's not very DRY.

Comment by Brian Johnson [ 10/May/12 ]

I misunderstood the problem because as it turns out, we were outputting two different id values with one document each instead of one id value with two documents. So, we can make it work, but it means we need to do some checking in the finalize to see if the reduce has been run or not because we do calculations in the reduce that have to change the output format.

Comment by Brian Johnson [ 10/May/12 ]

I think I have this switched around, but the principle is the same if we look at what we are actually doing instead of this simplified case.

Comment by Brian Johnson [ 10/May/12 ]

We have a document collection with a set of values per month. We need to add up all the counts for each unique set of values per month. For instance:

{key: 'val', month: 'feb', users: 10}

,

{key: 'val', month: 'jan', users: 20}

,

{key: 'val2', month: 'feb', users: 30}

the map will group everything by key so it will emit

{_id:

{key: 'val'}

, value: {users: 10}}, {_id:

{key: 'val'}

, value: {users: 20}}, {_id:

{key: 'val2'}

, value: {users: 30}}

and reduce to

{_id:

{key: 'val'}

, value: {users: 30}}, {_id:

{key: 'val2'}

, value: {users: 30}}

if you take away val2, we have no way to get aggregate counts for key=val because each document is sent to finalize individually and the reduce is never run.

Comment by Eliot Horowitz (Inactive) [ 10/May/12 ]

What's an example of something you can't do because of that?

Comment by Eliot Horowitz (Inactive) [ 10/May/12 ]

What do you mean exactly by aggregate count?
If you just want to count occurrences, you just sum the values in the reduce and don't need finalize.
Doesn't matter if its called 0 times.

Comment by Brian Johnson [ 10/May/12 ]

I think the point again is that you are limiting the set of use cases for reduce. The "fix" applied in the other ticket severely limits what you can do for very limited gain in a small set of cases.

Comment by Brian Johnson [ 10/May/12 ]

So explain to me how you would get an aggregate count for a single key. You can't do that in a finalize.

Comment by Brian Johnson [ 10/May/12 ]

I noticed this ticket https://jira.mongodb.org/browse/SERVER-2333
I think the thinking on that ticket is not correct. If you want to aggregate values for a given key, it won't work and this is exactly what we are trying to do.

Comment by Eliot Horowitz (Inactive) [ 10/May/12 ]

This is as design.
Reduce is not supposed to change the format as it can be re-reduced.
You should use finalize to change the structure.

Generated at Thu Feb 08 03:09:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.