[SERVER-2517] Allow mapReduce to create full documents Created: 10/Feb/11 Updated: 06/Dec/22 Resolved: 28/Jun/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | MapReduce |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Chris Eppstein | Assignee: | Backlog - Query Team (Inactive) |
| Resolution: | Won't Do | Votes: | 149 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Description |
|
This is a follow up to a support thread: http://groups.google.com/group/mongodb-user/browse_thread/thread/7e1f16c81940c24b Please allow mapReduce jobs to optionally return full documents instead of placing the result into the value field. The _id of the documents would be the reduction key, but the returned value from reduce (or finalize) would be required to be a hash. This makes working with a permanent collection generated by a map/reduce job much nicer. I suggest that a new out option be added: out: {document: 1} |
| Comments |
| Comment by Asya Kamsky [ 28/Jun/19 ] | ||||||
|
This project was superceded by aggregation pipeline adding ability to merge its output into existing collection, including sharded ones via stage $merge. Closing this ticket. | ||||||
| Comment by Asya Kamsky [ 28/Jan/19 ] | ||||||
|
mark.monroe@aruplab.com thanks for the response. My goal is to determine exactly this - what gaps does aggregation have that still prevent someone from switching to it from map-reduce. We added ability to add to existing collection in aggregation for 4.2 but we are trying to prioritize/schedule other work for it. The idea in that feature is to allow writing to collection via another aggregation, or regular inserts or any other way, so we wouldn't recreate the existing problems of map-reduce. | ||||||
| Comment by Mark Monroe [ 28/Jan/19 ] | ||||||
|
The aggregation framework currently does not allow you to append to an existing collection, rather, you can only replace an existing collection with the entire output of the aggregation. Map-reduce is useful when you want to append to an existing collection. In version 4.2 of MongoDB, it looks like the ability to append to an existing collection via an aggregation will be added. That makes this feature request moot for me, as I will just use aggregation instead of map-reduce. When you do append to an existing collection with map-reduce, we want the collection to stay around for a while, and having it create full documents in the value field is ugly from a data structure point of view. For example, what if documents in that collection are added via a normal insert as well as through map-reduce? It looks weird to have normal inserts put the document in a nested value field. | ||||||
| Comment by Asya Kamsky [ 28/Jan/19 ] | ||||||
|
Those of you recently voting on this ticket, can you give some additional insight about what exactly you are doing in map-reduce (and why you have to continue using it rather than moving to aggregation framework which runs natively on the server, allows arbitrary transformation of documents, etc)? | ||||||
| Comment by Anton [ 23/Jan/19 ] | ||||||
|
+ 1 | ||||||
| Comment by Naidu S [ 04/Oct/17 ] | ||||||
|
+1 | ||||||
| Comment by Rahul Shukla [ 09/Aug/16 ] | ||||||
|
+1 | ||||||
| Comment by Matteo Moci [ 01/Jul/16 ] | ||||||
|
+1 | ||||||
| Comment by Gonzalo Diaz [ 09/Jun/16 ] | ||||||
|
+1 ! | ||||||
| Comment by Vaidas Laauskas [ 13/Oct/15 ] | ||||||
|
+1 | ||||||
| Comment by Dhruv Gairola [ 10/Oct/15 ] | ||||||
|
+1 | ||||||
| Comment by Kyle Estes [ 28/Jul/15 ] | ||||||
|
For what it is worth, this is the very first time I used the map-reduce functionality of MongoDB, and I found myself here. | ||||||
| Comment by Gabor Mezo [ 24/Jul/15 ] | ||||||
|
+100 | ||||||
| Comment by Ian Beaver [ 24/Jun/15 ] | ||||||
|
+1 making this configurable would allow MR jobs to merge documents back into collections that were not originally created by a MR job. | ||||||
| Comment by Vinicius Seixas [ 20/Mar/15 ] | ||||||
|
+1 | ||||||
| Comment by Kay Fleischmann [ 15/Jan/15 ] | ||||||
|
+1 Definitly a must have. | ||||||
| Comment by Jeff Whelpley [ 30/Sep/14 ] | ||||||
|
+1 There are so many uses for this functionality if it existed. At a high level, it makes it much easier to synchronize de-normalized data. | ||||||
| Comment by Vincent [ 09/Aug/14 ] | ||||||
|
Can't believe it can't be done already... through the finalize function, it would be SO easy... | ||||||
| Comment by Mohammad Rafi [ 21/Jul/14 ] | ||||||
|
+1 | ||||||
| Comment by Ricardo Oliveira [ 10/Jun/14 ] | ||||||
|
+1 and generally a callback function that will allow to transform both the key and the value | ||||||
| Comment by Rafael [ 24/Feb/14 ] | ||||||
|
+1 | ||||||
| Comment by Kelvin Mackay [ 31/Jan/14 ] | ||||||
|
This would be a game changer. +1! | ||||||
| Comment by yassine chekkoury [ 04/Dec/13 ] | ||||||
|
+1 | ||||||
| Comment by bertrand [ 21/Nov/13 ] | ||||||
|
+1 | ||||||
| Comment by Martin Peranic [ 25/Oct/13 ] | ||||||
|
+1 | ||||||
| Comment by Tamas Foldenyi [ 18/Sep/13 ] | ||||||
|
+1 | ||||||
| Comment by David Castro [ 05/Sep/13 ] | ||||||
|
+1 This would allow the results of a mapReduce to be returned such that they simply appear to be a subset of the original collection. Seems like a pretty natural expectation when you are using mapReduce in this way, which appears to be fairly common from the looks of this thread. In any case, requiring further mangling/unwrapping to remove the added structure from output records is a bit tedious and requires some additional overhead in processing. Not adding it in the first place would be ideal. So, +1 for non-wrapped ("flat") output. | ||||||
| Comment by Vinícius Borriello [ 31/Jul/13 ] | ||||||
|
+1 indeed | ||||||
| Comment by Michael Ahlers [ 24/Jul/13 ] | ||||||
|
I would like this, too. Keying on _id, and returning the updated document seems a natural way to implement document migration when schemas change. (Unless, of course, someone can offer another strategy.) | ||||||
| Comment by Paul Hadfield [ 15/Jul/13 ] | ||||||
|
+1 This would be very helpful for my application too. | ||||||
| Comment by Daniel Gafitescu [ 08/Jul/13 ] | ||||||
|
I need it as well +1 | ||||||
| Comment by Stefan Fochler [ 01/Jul/13 ] | ||||||
|
Definitely a +1. | ||||||
| Comment by Scott Jappinen [ 25/Jun/13 ] | ||||||
|
+1 map means map, i.e. output should look like input plus desired transformation. I don't get why anyone would desire the default behavior of wrapping the desired output in a value block--how is this useful? Why reducers are required I don't get either. With map reduce one should be able to write simple identity mappers and identity reducers. | ||||||
| Comment by Will Shaver [ 20/May/13 ] | ||||||
|
+1 for some implementation allowing for more shaping of the output document. | ||||||
| Comment by Tianon Gravi [ 28/Feb/13 ] | ||||||
|
I think if we're having some kind of incremental merging behavior, it would be useful to be able to specify things like $addToSet, $unset, and $inc instead of just doing an implicit $set or even a naive straight-up replacement. This way, using mapReduce to calculate things like statistics becomes trivial, and will eventually get the automatic added benefit of parallelism (now that we have v8) without then adding another naive loop over the output data structure to push the data back into the original. Just my $0.02, for what it's worth. I know at my company, we'd be plenty happy with just the option to have flat output, as that would make mapReduce useful to us again for at least some cases. As it stands now, almost everywhere we have used it in the past or might use it in the future, we write all our tasks in some other language like Perl or Go, even for tasks that really would be much simpler directly in JavaScript if mapReduce could have flat output. | ||||||
| Comment by John Crenshaw [ 27/Feb/13 ] | ||||||
|
@Reuben Garrett, Actually, trying to accomplish incremental MR with an additional initialize would be devastating to performance, because it would mean that every record in the target collection would have to be processed do determine what reduces against what. IMO, all that is really needed is an option for "flat" output. Implementation is very simple:
As a practical matter you can accomplish basically any output format you like this way without significant impact to incremental runtime. The one exception is that the _id value in the incremental collection must exactly match the _id you emit in the map phase, which I think is fair, expected (at least once you think about it), and a small price to pay for not ruining performance. Syntax might be something like: | ||||||
| Comment by Reuben Garrett [ 27/Feb/13 ] | ||||||
|
+1 for "finalize" as a place to define document layout (martynovs, supra). I also see bugslayer's concern as legitimate - so consider adding an option for an "initialize" function to transform the documents from their persisted format into the manipulation format preferred by map/reduce. Transform functors are a powerful means to achieve customization without requiring exhaustive implementation effort on the part of the committers. To be sure, their work is far from trivial - what I mean is that we can reap significant value by exposing an API hook and delegating implementation of specific use-cases to the user. | ||||||
| Comment by John Crenshaw [ 24/Feb/13 ] | ||||||
|
The big use case I see for this is incremental map reduce (out: { reduce: "session_stat" }). With an incremental map reduce you can't restructure the data after running, because if you do, then it won't properly reduce new results against the data. This severely limits the usefulness of the reduce output type. | ||||||
| Comment by Brandon Berry [ 21/Nov/12 ] | ||||||
|
+1 I'd fund a few kegs for this feature. My 'work around' is to implement custom deserializer(s) at the driver level to map the value to the target object. Intermediate collections and eval() statements gives me a twitch, but I guess for tiny data sets isn't a big deal. | ||||||
| Comment by Michael Saffitz [ 09/Sep/12 ] | ||||||
|
Another +1 for this feature request. Our use case is similar to Scotty Allen's. | ||||||
| Comment by Lars Niemann [ 24/Aug/12 ] | ||||||
|
+1 for this feature request! | ||||||
| Comment by Arian Ryan [ 20/Aug/12 ] | ||||||
|
+1 for this feature request. We work with a 3rd party analytics firm. They offer limited support for data stored in MongoDB. One of the limitations is that they can't (won't) handle data nested more than one level deep. Now I have to make a solution to move the m/r output data "up" out of the arbitrary 'value' level. Not the end of the world, but it would be cool if I didn't have to. | ||||||
| Comment by Scotty Allen [ 05/Aug/12 ] | ||||||
|
We just got bitten by this as well. We were hoping to use map reduce to setup permanent data warehouse fact tables. This design choice makes this significantly clunkier. | ||||||
| Comment by Tianon Gravi [ 24/Jul/12 ] | ||||||
|
If you're going to use eval() to flatten, you definitely want to make sure you're well familiar with the following documentation regarding locking: http://www.mongodb.org/display/DOCS/Server-side+Code+Execution#Server-sideCodeExecution-Writelocks | ||||||
| Comment by Isaac Cambron [ 24/Jul/12 ] | ||||||
|
Thanks Doug. That's the solution I settled on too. It feels pretty silly though, so I was hoping someone had something better. I will try moving it into an eval() call, though. | ||||||
| Comment by Doug Hudson [ 24/Jul/12 ] | ||||||
|
Isaac, you have to foreach the MR target collection, then flatten and save to another collection (or update back to the same collection). For large number of map reduce output documents, recommend to do this as an eval() rather than through a client driver for performance reasons. | ||||||
| Comment by Isaac Cambron [ 24/Jul/12 ] | ||||||
|
Given that explicitly saving to another collection in finalize no longer works (as noted above), what's the best way to work around this? How are people flattening the values? | ||||||
| Comment by Chris Vincent [ 20/Jul/12 ] | ||||||
|
+1, would be extremely useful! | ||||||
| Comment by Sergey Martynov [ 28/Jun/12 ] | ||||||
|
+1 for "finalize" as a place to define document layout | ||||||
| Comment by Ian Greenhoe [ 15/Jun/12 ] | ||||||
|
+1 as well, and I particularly like the idea of using "finalize" for this. I'm interested in using this in much the same way as the other reporters – to maintain a permanent M/R collection. | ||||||
| Comment by MajiD [ 16/May/12 ] | ||||||
|
+1 for the feature. | ||||||
| Comment by Penn Taylor [ 24/Apr/12 ] | ||||||
|
+1 for this feature addition. We use mongo for object persistence in a C++ context, and the current output of MapReduce requires us to manually flatten the objects before we can access them in a meaningful way. I shouldn't have to know whether the document I'm using to populate an object is a first-gen document or the output of a MapReduce job. As things stand, we push the output of MapReduce into a special "MROUT" collection, then iterate through the documents in that collection, flatten them, and move them into the main collection. That seems like unnecessary work. | ||||||
| Comment by Michael Doberenz [ 28/Oct/11 ] | ||||||
|
Perhaps instead of adding another option, this behavior could be built into the finalize function directly. As I understand finalize, it's currently used to transform the final value into another value, and as such the "default" implementation of finalize would be the following:
What if finalize instead returned the document that should be stored in the permanent collection? In other words, the "default" implementation would be the following:
The documentation seems to suggest that finalize is useful in only a handful of situations. Perhaps this solution would both enhance the usefulness of an existing feature and be relatively low impact, especially given that the required change to replicate the current behavior is straightforward. | ||||||
| Comment by Flori [ 05/Oct/11 ] | ||||||
|
I want to use mapreduce to get some aggregated data from one collection into an existing old one (with { reduce : "oldExistingCollection" }). Without this feature it isn't possible in a clean way because the structure of the object in the old existing collection is of course already in use. At a first step it would help already to be able to change the "value" name to something other. | ||||||
| Comment by Doug Hudson [ 08/Sep/11 ] | ||||||
|
+1 for this. I have quite a few MR jobs that now forEach the target collection to flatten objects. For hundreds of millions of documents it's just a lot of extra work, and db size grows as the output collection is effectively duplicated. Indexes are also simplified when at the top level rather than nested in 'value'. (approach to save in finalize() is no longer valid as latest mongo doesn't expose db object to finalize, at least the last time I tried it) | ||||||
| Comment by Mathias Stearn [ 31/Aug/11 ] | ||||||
|
Related stackoverflow question: http://stackoverflow.com/questions/7257989/in-mongodb-mapreduce-how-can-i-flatten-the-values-object/ | ||||||
| Comment by Mark Wouters [ 09/Apr/11 ] | ||||||
|
I think this option is the only way I can start using my mapreduce collections as "regular" collections. In my case I want to prepare an exported/imported collection to be used in my application. I transform/complete the imported collection using mapreduce (e.g. to rename fields or add default values). The problem is that the result ends up in "value" field and there is no way I can work around this. Unless I structure my regular collections to be available under value. Data transformations through mr are nice, but not usable as new sources this way... | ||||||
| Comment by Mark Embling [ 07/Mar/11 ] | ||||||
|
I would also like to throw in my +1 for this, and for the same reason Micah describes: working with permanent MR-generated collections. In my case, this is particularly the case for MR runs which roll up or otherwise aggregate data according to various criteria - I currently end up with weird documents a bit like this (simplified for example purposes): { "_id": { "date": /* a date */, "sensor": /* a number */ }, "value": { "reading": /* a number */ }} As you can see, the relevent data is split across the "_id" and the "value" hashes, and that's just weird. I'd prefer to be able to create a document where the important three fields are at the top level. I couldn't really care less about what "_id" then contained (just an auto-generated ObjectID would be fine, as in other collections...). At present, I'm putting up with working with the documents as they come above, and currently looking into using a save-in-finalize() approach like Micah has talked about. Neither solution is as smooth as I'd like if I'm honest. I think this feature would smooth off this slightly confusing area perfectly. | ||||||
| Comment by Micah Wedemeyer [ 15/Feb/11 ] | ||||||
|
+1 I agree this would be nice for exactly the reason specified: working with permanent MR-generated collections. Currently, the way I handle this is by calling db.my_other_collection.save(value) in the finalize() function. It works, but feels very clunky, and it prevents me from using the 'out' parameter as intended. | ||||||
| Comment by Chris Eppstein [ 10/Feb/11 ] | ||||||
|
It's a polish feature, IMO. I've been very satisfied with the very smooth and polished interactions I've had with mongo, and this aspect stands out as very awkward. It's not like I can't live with it, it's just that given all the recent developments regarding mapreduce I feel this feature rounds out that feature set. Functionally, having documents formatted this way makes it much easier for me to integrate it with my language bindings (Mongoid in Ruby) and smooths out the data access in the application code. | ||||||
| Comment by Eliot Horowitz (Inactive) [ 10/Feb/11 ] | ||||||
|
Why do you need them in a different format? This feature by itself isn't the problem. | ||||||
| Comment by Chris Eppstein [ 10/Feb/11 ] | ||||||
|
It would be great if you could explain why you're hesitant to add this feature. Without it, I have to generate to a temp collection (that is no longer temporary and I have to now manually clean up) and then manually iterate over my results and insert them into another collection as a proper document. It's a bunch of unnecessary busy work. | ||||||
| Comment by Eliot Horowitz (Inactive) [ 10/Feb/11 ] | ||||||
|
Dont' want to add more options unless there is a general consensus that is needed. |