[SERVER-2340] MapReduce finalize should be able to throw result row away Created: 10/Jan/11 Updated: 06/Dec/22 Resolved: 04/Feb/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | MapReduce |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Trivial - P5 |
| Reporter: | Juho Mäkinen | Assignee: | Backlog - Query Optimization |
| Resolution: | Done | Votes: | 13 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Query Optimization
|
| Backwards Compatibility: | Minor Change |
| Participants: |
| Description |
|
I have a use case where I would like to throw away a result in finalize phase withing map reduce. AFAIK currently finalize can only modify the result object but not remove it completely. My use case consists a map reduce where I first emit { count : 1 }in map phase and then I sum the counts together in reduce phase. Then I would like to discard all results which count is less than some value and return only those which count is greater than my requirement. In practice the finalize will discard 99.99% of my results away so it would be much more efficient to do it there instead of manually iterating or querying the result temp collection. I propose that returning a null in finalize phase would discard the result. Currently all examples of the finalize function will return the result object, so implementing this would not change the current behavior. |
| Comments |
| Comment by Esha Bhargava [ 04/Feb/22 ] |
|
Closing these tickets as part of the deprecation of mapReduce. |
| Comment by Asya Kamsky [ 30/Aug/19 ] |
|
kostano@yahoo.com I recommend you reconsider using map-reduce for your use case and see if it's possible to do it in aggregation pipeline. We are moving away from support map-reduce and don't plan to add new functionality to it, in favor of enhancing aggregation pipeline so that it can do everything map-reduce can currently do (and much more).
Is there a reason you cannot use aggregation to perform the workflow you described?
|
| Comment by Kostyantyn Oliynyk [ 29/Aug/19 ] |
|
Hi everybody, I need this functionality also. This is my business case. I have small collection A ~ 0.5 mill documents which I need to populate with information stored in "lookup" collection B which is 80 mill documents. The challenge is complex mapping logic to match document in A with document in B. There is no way I can create index in B. My plan was to use map reduce - in map function for collection B create key (following complex logic, can be more then one document emitted, analogy of flat map for java streams) with required information in value and output to collection A so matched keys would be passed to reduce function. In reduce function add specific tag to a value so I could identify documents that was processed by reduce function. In finalize function I would return null so not required documents would be discarded. Another possibility would be add filter condition for map reduce result, same kind of approach like used for filtering mapping documents with query tag.
|
| Comment by Asya Kamsky [ 16/Jun/16 ] |
|
gargsatish0 aggregation pipeline can handle very complex conditionals (though not all, obviously) if there are missing functions then please be sure and add a new server ticket requesting new functionality for aggregation so that it can be the solution for your use case. |
| Comment by Satish Garg [ 11/Jun/16 ] |
|
Hi, Also, cannot use Aggregation pipeline as I need complex conditional projections to make, And, I can't do this application side as both source and resultant dataset is huge. Thanks, |
| Comment by Maziyar Panahi [ 26/May/15 ] |
|
I would also appreciate if it's possible to remove the key from the results at the finalize stage. One of the best things in finalize is to see if the results at the end meets our conditions. From millions of documents it will be reduced to couple of thousands which is a huge improvements on inserts and also further operations on the result_collection. If it was possible to use aggregation I would have definitely used it by now as I do for so many other things since it's faster, easier and more convenient but it is not as flexible as MR! Thanks, |
| Comment by Asya Kamsky [ 03/Nov/14 ] |
|
This is easily done if your MR can be done as aggregation pipeline (as original example can be)... |
| Comment by Jalmari Raippalinna [ 03/Nov/14 ] |
|
I need this feature in following use case: Due to performance requirements we are forced to use { inline: 1 }as output. We are reducing from huge data set to dataset that is over 16 megabytes (which is the BSON size limit), but 90% of data could be discarded on finalize by just returning null. Add option { discardNullOnFinalize: 1}if you are concerned that users might want to return null object with key still in result set. |
| Comment by Lalit Agarwal [ 11/Apr/13 ] |
|
I am also facing the same issue. Any updates on this improvement? |
| Comment by Antoine Girbal [ 17/Oct/11 ] |
|
we need to agree on feature 1st |
| Comment by auto [ 17/Oct/11 ] |
|
Author: {u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}Message: |
| Comment by auto [ 17/Oct/11 ] |
|
Author: {u'login': u'agirbal', u'name': u'agirbal', u'email': u'antoine@10gen.com'}Message: |
| Comment by Antoine Girbal [ 17/Oct/11 ] |
|
to use this feature, return null in finalize. |
| Comment by Juho Mäkinen [ 10/Jan/11 ] |
|
Example of a finalize function which would fit the use case: f = function (key, value) { else { return null; }} |