[SERVER-5149] Auto-generated short-lived collections for output from $merge or $out Created: 29/Feb/12  Updated: 06/Dec/22  Resolved: 04/Feb/22

Status: Closed
Project: Core Server
Component/s: Aggregation Framework, MapReduce
Affects Version/s: 1.8.0, 2.0.0
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Andrew Morrow (Inactive) Assignee: Backlog - Query Execution
Resolution: Done Votes: 1
Labels: query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Execution
Participants:

 Description   

It would be useful to have the ability to store results from your computations in the server for future inspection in a way that would go away after some period of time (perhaps when the session ends?) without having to choose unique names.

Original Description

With mongo 1.8 the ability to have a mapreduce write its results to a temporary table was removed, according to the documentation here: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Outputoptions.

This means that clients that want 'one shot' MR operations are forced to either use inline output, or to manage the naming, creation, and reaping of the output collection. If, as is very often the case, the total data size that will be created exceeds the current database maximum BSON object size (16MB these days), then you cannot use inline, and the only option is for the client to manually manage the lifetime of the output collection. However, this is not as easy as it sounds:

  • The client must have some mechanism for generating unique names for the output collection. While not necessarily hard, this is annoying. It also creates a risk that less than diligent authors will pick a 'unique' name like 'my_secret_output_collection', which then of course will inevitably collide months later when some other genius picks the same 'unique' name. However, the database server can easily construct non-colliding temporary collection names and communicate the name back to the client in the MR response. Even if you don't restore the server side reaping feature, the server could still synthesize a safe unique name for you.
  • The client needs to carry about the name of the output collection somewhere so that it can (try its best to) drop the collection later. This frustrates certain programming idioms like iterating the cursor from a coroutine/generator, since there now needs to be a post-iteration action that drops the collection.
  • The client needs to drop the collection when it is done with it, but is unequipped to reliably do so. No matter how many exception handlers, signal handlers, finalizers, or redundant power supplies and switches you pollute your code and infrastructure with, there is always a way that the client fails to honor its obligation to drop the collection. On the other hand, the mongodb server is in a uniquely qualified position to reap stale temporary collections, since it can easily detect that the client responsible for the collection has gone away and that it should now reap the useless collection. In addition, the server can much more easily work around the sort of failure modes noted above, since in the event of its own crash, it should easily be able to identify temporary collections on restart/shutdown and destroy them.

Overall, I'm perplexed why this very useful feature was removed. It takes away something that the server should be easily able to do with very high reliability, and forces clients to make complex (but ultimately hopeless) efforts to re-implement the feature themselves.



 Comments   
Comment by Esha Bhargava [ 04/Feb/22 ]

Closing these tickets as part of the deprecation of mapReduce.

Comment by Pawel Terlecki [ 18/Jan/20 ]

Support for temp collections is fundamental if one wants to Mongo to act as a node in any cross-database federated processing. For example, ETL in Alteryx, Tableau and basically any other cross-db processing engine is the only way to move piece of data across databases, e.g. for joins. Without this feature data from mongo will always need to be extracted fully for processing, even if most data in the scenario is in Mongo. This will happen in live federated models involving mongo and other databases in Tableau. Some of data blending scenarios will be completely unavailable.

In addition, large filters are often externalized by temp collections for fast filtering. E.g. Tableau is extremely slow against data sources that do not support temp tables. In our case, we first need to fix our lookups with SBE to actually be faster in this scenario.

 

Comment by Charlie Swanson [ 22/Aug/19 ]

Converting this to a feature request since it's been long enough since that release that I wouldn't consider this a bug anymore.

Comment by Asya Kamsky [ 22/Mar/19 ]

acm is this an issue? the way mr (and agg) output now it's written to a temp collection that's then renamed, so is that sufficient? I'm not sure what pre 1.8 behavior was.

Comment by Eliot Horowitz (Inactive) [ 01/Mar/12 ]

The major problem was that the previous incarnation didn't make a lot of sense as designed.

Agree it is a good feature, just not sure on all the design issues at this point.

Generated at Thu Feb 08 03:08:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.