In order to report progress of ongoing mapReduce jobs, the filter query used for the input documents to the mapReduce job is run to get the total count of documents affected. For long running queries, this extra logging information is very costly to overall mapReduce run time.
This fix is a performance improvement only. There is a change in the log messages reported in the log during a mapReduce in the case that a filter is used. Instead of outputting "percentage complete," a running count of documents processed is reported.
The issue has been resolved by only using the total count of documents in the ProgressMeter in the case that there is no query filter used.
There is no workaround.
Production release v2.4.7 contains the fix for this issue, and production release v2.6.0 will contain the fix as well.
A significant portion of the map reduce job may be spent actually matching the input documents.
Right now we do an initial count() (line 594 mr.cpp) in order to display the progress meter.
In my production example, about 90% of the time is spent matching the input documents (no ideal way to index further) and consequently the initial count() waste takes half of the entire job time.
- remove the initial count() and progress meters should just display how many haven been done instead of % of completion
- add an option like "in.showProgress: false" to disable the count().
This map reduce application will have to ingest a large volume of data, and the matching rules are pretty complex, so having that option may save up to 50% of MR execution time.
- is related to
SERVER-12710 Map-Reduce reports incorrect stats in db.currentOp