Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-5149

Auto-generated short-lived collections for output from $merge or $out

    • Query Execution

      It would be useful to have the ability to store results from your computations in the server for future inspection in a way that would go away after some period of time (perhaps when the session ends?) without having to choose unique names.

      Original Description

      With mongo 1.8 the ability to have a mapreduce write its results to a temporary table was removed, according to the documentation here: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-Outputoptions.

      This means that clients that want 'one shot' MR operations are forced to either use inline output, or to manage the naming, creation, and reaping of the output collection. If, as is very often the case, the total data size that will be created exceeds the current database maximum BSON object size (16MB these days), then you cannot use inline, and the only option is for the client to manually manage the lifetime of the output collection. However, this is not as easy as it sounds:

      • The client must have some mechanism for generating unique names for the output collection. While not necessarily hard, this is annoying. It also creates a risk that less than diligent authors will pick a 'unique' name like 'my_secret_output_collection', which then of course will inevitably collide months later when some other genius picks the same 'unique' name. However, the database server can easily construct non-colliding temporary collection names and communicate the name back to the client in the MR response. Even if you don't restore the server side reaping feature, the server could still synthesize a safe unique name for you.
      • The client needs to carry about the name of the output collection somewhere so that it can (try its best to) drop the collection later. This frustrates certain programming idioms like iterating the cursor from a coroutine/generator, since there now needs to be a post-iteration action that drops the collection.
      • The client needs to drop the collection when it is done with it, but is unequipped to reliably do so. No matter how many exception handlers, signal handlers, finalizers, or redundant power supplies and switches you pollute your code and infrastructure with, there is always a way that the client fails to honor its obligation to drop the collection. On the other hand, the mongodb server is in a uniquely qualified position to reap stale temporary collections, since it can easily detect that the client responsible for the collection has gone away and that it should now reap the useless collection. In addition, the server can much more easily work around the sort of failure modes noted above, since in the event of its own crash, it should easily be able to identify temporary collections on restart/shutdown and destroy them.

      Overall, I'm perplexed why this very useful feature was removed. It takes away something that the server should be easily able to do with very high reliability, and forces clients to make complex (but ultimately hopeless) efforts to re-implement the feature themselves.

            Assignee:
            backlog-query-execution [DO NOT USE] Backlog - Query Execution
            Reporter:
            andrew.morrow@mongodb.com Andrew Morrow (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: