A significant portion of the map reduce job may be spent actually matching the input documents.
Right now we do an initial count() (line 594 mr.cpp) in order to display the progress meter.
In my production example, about 90% of the time is spent matching the input documents (no ideal way to index further) and consequently the initial count() waste takes half of the entire job time.
- remove the initial count() and progress meters should just display how many haven been done instead of % of completion
- add an option like "in.showProgress: false" to disable the count().
This map reduce application will have to ingest a large volume of data, and the matching rules are pretty complex, so having that option may save up to 50% of MR execution time.