Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-2340

MapReduce finalize should be able to throw result row away

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Trivial - P5
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: planned but not scheduled
    • Component/s: MapReduce
    • Labels:
      None
    • Backwards Compatibility:
      Minor Change

      Description

      I have a use case where I would like to throw away a result in finalize phase withing map reduce. AFAIK currently finalize can only modify the result object but not remove it completely.

      My use case consists a map reduce where I first emit

      { count : 1 }

      in map phase and then I sum the counts together in reduce phase. Then I would like to discard all results which count is less than some value and return only those which count is greater than my requirement. In practice the finalize will discard 99.99% of my results away so it would be much more efficient to do it there instead of manually iterating or querying the result temp collection.

      I propose that returning a null in finalize phase would discard the result. Currently all examples of the finalize function will return the result object, so implementing this would not change the current behavior.

        Activity

        Hide
        jalava Jalmari Raippalinna added a comment -

        I need this feature in following use case:

        Due to performance requirements we are forced to use

        { inline: 1 }

        as output.

        We are reducing from huge data set to dataset that is over 16 megabytes (which is the BSON size limit), but 90% of data could be discarded on finalize by just returning null.

        Add option

        { discardNullOnFinalize: 1}

        if you are concerned that users might want to return null object with key still in result set.

        Show
        jalava Jalmari Raippalinna added a comment - I need this feature in following use case: Due to performance requirements we are forced to use { inline: 1 } as output. We are reducing from huge data set to dataset that is over 16 megabytes (which is the BSON size limit), but 90% of data could be discarded on finalize by just returning null. Add option { discardNullOnFinalize: 1} if you are concerned that users might want to return null object with key still in result set.
        Hide
        asya Asya Kamsky added a comment -

        This is easily done if your MR can be done as aggregation pipeline (as original example can be)...

        Show
        asya Asya Kamsky added a comment - This is easily done if your MR can be done as aggregation pipeline (as original example can be)...
        Hide
        maziyar Maziyar Panahi added a comment -

        I would also appreciate if it's possible to remove the key from the results at the finalize stage. One of the best things in finalize is to see if the results at the end meets our conditions. From millions of documents it will be reduced to couple of thousands which is a huge improvements on inserts and also further operations on the result_collection.

        If it was possible to use aggregation I would have definitely used it by now as I do for so many other things since it's faster, easier and more convenient but it is not as flexible as MR!

        Thanks,
        Maziyar

        Show
        maziyar Maziyar Panahi added a comment - I would also appreciate if it's possible to remove the key from the results at the finalize stage. One of the best things in finalize is to see if the results at the end meets our conditions. From millions of documents it will be reduced to couple of thousands which is a huge improvements on inserts and also further operations on the result_collection. If it was possible to use aggregation I would have definitely used it by now as I do for so many other things since it's faster, easier and more convenient but it is not as flexible as MR! Thanks, Maziyar
        Hide
        gargsatish0 Satish Garg added a comment - - edited

        Hi,
        Any update on this?
        Is it possible now to remove key from map reduce finalize function or any such workaround that i may try?

        Also, cannot use Aggregation pipeline as I need complex conditional projections to make,
        which are currently not supported.

        And, I can't do this application side as both source and resultant dataset is huge.

        Thanks,
        Satish

        Show
        gargsatish0 Satish Garg added a comment - - edited Hi, Any update on this? Is it possible now to remove key from map reduce finalize function or any such workaround that i may try? Also, cannot use Aggregation pipeline as I need complex conditional projections to make, which are currently not supported. And, I can't do this application side as both source and resultant dataset is huge. Thanks, Satish
        Hide
        asya Asya Kamsky added a comment -

        Satish Garg aggregation pipeline can handle very complex conditionals (though not all, obviously) if there are missing functions then please be sure and add a new server ticket requesting new functionality for aggregation so that it can be the solution for your use case.

        Show
        asya Asya Kamsky added a comment - Satish Garg aggregation pipeline can handle very complex conditionals (though not all, obviously) if there are missing functions then please be sure and add a new server ticket requesting new functionality for aggregation so that it can be the solution for your use case.

          People

          • Votes:
            12 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Days since reply:
              14 weeks, 5 days ago
              Date of 1st Reply: