[SERVER-10736] Modify MapReduce to "map, shuffle, reduce", and always take lists on the reducer input Created: 11/Sep/13  Updated: 06/Dec/22  Resolved: 04/Feb/22

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Nicolau Leal Werneck Assignee: Backlog - Query Execution
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-5818 reduce in map reduce doesn't run with... Closed
related to SERVER-2333 mapreduce optimization: do not execut... Closed
Assigned Teams:
Query Execution
Participants:

 Description   

The MapReduce command from MongoDB takes two non-optional functions, "map" and "reduce", and an optional "finalize" function. "reduce" is supposed to output the same data format from the "map" function.

In some other frameworks, the functions are "map", "shuffle" and "reduce". "shuffle" is the one supposed to output the same data format from "map", just like the "reduce" from mongoDB, but it is "shuffle" that is the optional function, and the non-optional "reduce" is more like the "finalize" from MongoDB. "shuffle" is also known as "local reduce".

It would be great if MongoDB could work like this instead, with the different nomenclature and optional parameters. Maybe changing the mapReduce method, or maybe creating a new method...

Another interesting modification is to always deliver the data to the final step ("finalize"/"reduce") inside a list, even if there is just one item. This way we can always assume there is a list to process, and the method becomes simpler to write.

It should also be easy to have an "identity reducer", it could be the default when no reducer is specified.

Related tickets:



 Comments   
Comment by Esha Bhargava [ 04/Feb/22 ]

Closing these tickets as part of the deprecation of mapReduce.

Comment by Rafael [ 24/Feb/14 ]

A design that always deliver the data to the final step ("finalize"/"reduce") inside a list, even if there is just one item is a more robust long term solution.

This way we can always assume there is a list to process, and the method becomes simpler to write.

Comment by Nicolau Leal Werneck [ 11/Sep/13 ]

The title is obviously incorrect, it should be "...and always take lists on the reducer input".

Also, I should note that we could maintain compatibility with current MongoDB by keep calling it "reduce" instead of "shuffle", and using it as "finalize" ("reduce" in the Bizarro World) if no "finalize" is specified.

This is pretty much what happens today, but there would still be some changes: finalize (reduce) and reduce (shuffle) are not exactly optional or non-optional. It is only necessary to have at least one of them. The change is to allow us to have only "finalize" if we so desire. And if none of them is available, the output should be the identity reducer, a list of all values from each key.

Generated at Thu Feb 08 03:23:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.