[SERVER-5818] reduce in map reduce doesn't run with only one input document Created: 10/May/12 Updated: 11/Sep/13 Resolved: 10/May/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | MapReduce |
| Affects Version/s: | 2.0.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Brian Johnson | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
centos 6 |
||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
If you run a map reduce and only emit a single document, the reduce doesn't get run. At first it seems like this would make sense since there's nothing to reduce, but the reduce often changes the document format (gets counts, etc...) and therefore the output you get will be wrong for those operations. |
| Comments |
| Comment by adam nisenbaum [ 11/May/12 ] | ||||||||
|
this seems like a bug. writing m/r results to an output collection skips writing documents where map emits one and only one key with one and only one value passed to reduce | ||||||||
| Comment by Brian Johnson [ 10/May/12 ] | ||||||||
|
so, I've been thinking about it a lot, and I think you're right. If we make a change to our map function, we can make our reduce "re-reducable" and then we just need a finalize for the 1-(ab)/1-(cd) calculation | ||||||||
| Comment by Eliot Horowitz (Inactive) [ 10/May/12 ] | ||||||||
|
For all of those cases you can use finalize. | ||||||||
| Comment by Brian Johnson [ 10/May/12 ] | ||||||||
|
semantics? I'm talking about a class of problems that can't be solved using a re-reducable function. Now, I could see your point in a sharded environment where you might have to do a re-reduction on the sets returned by each shard, but again, there's not reason to require it. Even if it is sharded, you could potentially still avoid the issue depending on the map and your shard key choice. If you want to make the optimization, fine, but don't require it. Add a flag to turn it off. Why would you want to limit what your customers can do with your product? That just doesn't make any sense. | ||||||||
| Comment by Eliot Horowitz (Inactive) [ 10/May/12 ] | ||||||||
|
re-reduce has massive performance and scalability benefits, as it allows for much tighter memory usage. | ||||||||
| Comment by Brian Johnson [ 10/May/12 ] | ||||||||
|
reduces can be re-reducable, but there is no reason they have to be. Placing this restriction eliminates all kinds of useful things you can do with map-reduce. in the above case, the _id really should have been key, so the document would be {key:'val', etc...}and the map would produce {_id: {key:'val'}, value: {count_a: etc... Even if you have some philosophical disagreement about what a reduce should be, that shouldn't limit what we can do with it. You can have the best of both worlds by adding a flag to the mapreduce function that allows you to turn off the "optimization" that was created in the other ticket. | ||||||||
| Comment by Eliot Horowitz (Inactive) [ 10/May/12 ] | ||||||||
|
This doesn't work anyway if I understand as reduces can be reduced themselves. What is the map in the above example? Don't see any map. lets say you have to build an array of all values for a key, you have to do it this way:
| ||||||||
| Comment by Brian Johnson [ 10/May/12 ] | ||||||||
|
here is a better example: input documents {_id: 'val', value: {count_a: 10, count_b: 20, count_c: 30, count_d: 40}}, {_id, 'val2', value: {count_a: 15, count_b: 20, count_c: 25, count_d: 50}} same map as above then reduce is function(key, values) { values.forEach(function(value) { ab *= 1 - (value.count_a/10*value.count_b) cd *= 1 - (value.count_c/10*value.count_d) }); ; We need to calculate this new value using all the existing value and that is the output we need. If we had a finalize, it would have to first check to see what format the document is and then do a partial calculation if the reduce wasn't run. Like I said, we can do it, but it's not very DRY. | ||||||||
| Comment by Brian Johnson [ 10/May/12 ] | ||||||||
|
I misunderstood the problem because as it turns out, we were outputting two different id values with one document each instead of one id value with two documents. So, we can make it work, but it means we need to do some checking in the finalize to see if the reduce has been run or not because we do calculations in the reduce that have to change the output format. | ||||||||
| Comment by Brian Johnson [ 10/May/12 ] | ||||||||
|
I think I have this switched around, but the principle is the same if we look at what we are actually doing instead of this simplified case. | ||||||||
| Comment by Brian Johnson [ 10/May/12 ] | ||||||||
|
We have a document collection with a set of values per month. We need to add up all the counts for each unique set of values per month. For instance: {key: 'val', month: 'feb', users: 10}, {key: 'val', month: 'jan', users: 20}, {key: 'val2', month: 'feb', users: 30}the map will group everything by key so it will emit {_id: {key: 'val'}, value: {users: 10}}, {_id: {key: 'val'}, value: {users: 20}}, {_id: {key: 'val2'}, value: {users: 30}} and reduce to {_id: {key: 'val'}, value: {users: 30}}, {_id: {key: 'val2'}, value: {users: 30}} if you take away val2, we have no way to get aggregate counts for key=val because each document is sent to finalize individually and the reduce is never run. | ||||||||
| Comment by Eliot Horowitz (Inactive) [ 10/May/12 ] | ||||||||
|
What's an example of something you can't do because of that? | ||||||||
| Comment by Eliot Horowitz (Inactive) [ 10/May/12 ] | ||||||||
|
What do you mean exactly by aggregate count? | ||||||||
| Comment by Brian Johnson [ 10/May/12 ] | ||||||||
|
I think the point again is that you are limiting the set of use cases for reduce. The "fix" applied in the other ticket severely limits what you can do for very limited gain in a small set of cases. | ||||||||
| Comment by Brian Johnson [ 10/May/12 ] | ||||||||
|
So explain to me how you would get an aggregate count for a single key. You can't do that in a finalize. | ||||||||
| Comment by Brian Johnson [ 10/May/12 ] | ||||||||
|
I noticed this ticket https://jira.mongodb.org/browse/SERVER-2333 | ||||||||
| Comment by Eliot Horowitz (Inactive) [ 10/May/12 ] | ||||||||
|
This is as design. |