Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9907

Allow to skip initial count() in mapreduce

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4.7, 2.5.2
    • Component/s: MapReduce
    • Labels:
      None

      Description

      MongoDB Status as of October 9th, 2013

      ISSUE SUMMARY
      In order to report progress of ongoing mapReduce jobs, the filter query used for the input documents to the mapReduce job is run to get the total count of documents affected. For long running queries, this extra logging information is very costly to overall mapReduce run time.

      USER IMPACT
      This fix is a performance improvement only. There is a change in the log messages reported in the log during a mapReduce in the case that a filter is used. Instead of outputting "percentage complete," a running count of documents processed is reported.

      SOLUTION
      The issue has been resolved by only using the total count of documents in the ProgressMeter in the case that there is no query filter used.

      WORKAROUNDS
      There is no workaround.

      PATCHES
      Production release v2.4.7 contains the fix for this issue, and production release v2.6.0 will contain the fix as well.

      Original Description

      A significant portion of the map reduce job may be spent actually matching the input documents.
      Right now we do an initial count() (line 594 mr.cpp) in order to display the progress meter.

      In my production example, about 90% of the time is spent matching the input documents (no ideal way to index further) and consequently the initial count() waste takes half of the entire job time.

      Either:

      • remove the initial count() and progress meters should just display how many haven been done instead of % of completion
      • add an option like "in.showProgress: false" to disable the count().

      This map reduce application will have to ingest a large volume of data, and the matching rules are pretty complex, so having that option may save up to 50% of MR execution time.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: