Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 1.8.0, 2.0.0
Component/s: Aggregation Framework, MapReduce
Labels:
- query-44-grooming

Assigned Teams:

Query Execution
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

It would be useful to have the ability to store results from your computations in the server for future inspection in a way that would go away after some period of time (perhaps when the session ends?) without having to choose unique names.

Original Description

With mongo 1.8 the ability to have a mapreduce write its results to a temporary table was removed, according to the documentation here: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Outputoptions.

This means that clients that want 'one shot' MR operations are forced to either use inline output, or to manage the naming, creation, and reaping of the output collection. If, as is very often the case, the total data size that will be created exceeds the current database maximum BSON object size (16MB these days), then you cannot use inline, and the only option is for the client to manually manage the lifetime of the output collection. However, this is not as easy as it sounds:

The client must have some mechanism for generating unique names for the output collection. While not necessarily hard, this is annoying. It also creates a risk that less than diligent authors will pick a 'unique' name like 'my_secret_output_collection', which then of course will inevitably collide months later when some other genius picks the same 'unique' name. However, the database server can easily construct non-colliding temporary collection names and communicate the name back to the client in the MR response. Even if you don't restore the server side reaping feature, the server could still synthesize a safe unique name for you.

The client needs to carry about the name of the output collection somewhere so that it can (try its best to) drop the collection later. This frustrates certain programming idioms like iterating the cursor from a coroutine/generator, since there now needs to be a post-iteration action that drops the collection.

The client needs to drop the collection when it is done with it, but is unequipped to reliably do so. No matter how many exception handlers, signal handlers, finalizers, or redundant power supplies and switches you pollute your code and infrastructure with, there is always a way that the client fails to honor its obligation to drop the collection. On the other hand, the mongodb server is in a uniquely qualified position to reap stale temporary collections, since it can easily detect that the client responsible for the collection has gone away and that it should now reap the useless collection. In addition, the server can much more easily work around the sort of failure modes noted above, since in the event of its own crash, it should easily be able to identify temporary collections on restart/shutdown and destroy them.

Overall, I'm perplexed why this very useful feature was removed. It takes away something that the server should be easily able to do with very high reliability, and forces clients to make complex (but ultimately hopeless) efforts to re-implement the feature themselves.

Assignee:: [DO NOT USE] Backlog - Query Execution
Reporter:: Andrew Morrow (Inactive)
Participants:: [DO NOT USE] Backlog - Query Execution, Andrew Morrow, Asya Kamsky, Charlie Swanson, Eliot Horowitz, Esha Bhargava, Pawel Terlecki
Votes:: 1 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Feb 29 2012 10:48:52 PM UTC
Updated:: Dec 06 2022 05:35:36 AM UTC
Resolved:: Feb 04 2022 03:09:18 PM UTC

Details

Description

Original Description

Attachments

Activity

People

Dates