[SERVER-13201] Allow new Aggregation $merge stage to explicitly name a DB to write to Created: 14/Mar/14  Updated: 30/Oct/20  Resolved: 28/Aug/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 2.6.0-rc1
Fix Version/s: 4.1.3

Type: Improvement Priority: Major - P3
Reporter: Paul Done Assignee: Kyle Suarez
Resolution: Done Votes: 13
Labels: usability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-35895 Add ability for $out to write to remo... Closed
is depended on by SERVER-36081 Write auth tests for $out and bypassD... Closed
is depended on by SERVER-36832 Allow $out to different database Closed
Documented
is documented by DOCS-12025 Docs for SERVER-13201: Allow new Aggr... Closed
Duplicate
is duplicated by SERVER-13547 Aggregation framework should support ... Closed
is duplicated by SERVER-35898 Support writes to other databases usi... Closed
Related
related to SERVER-51886 $lookup + $merge pipeline may fail to... Closed
Backwards Compatibility: Major Change
Sprint: Query 2018-07-30, Query 2018-08-13, Query 2018-08-27, Query 2018-09-10
Participants:
Case:

 Description   

Using 2.6.0rc1 (Linux x86-64) I've been doing some research into speeding up some Aggregation use cases via Parallelisation. For the full investigation see here: http://pauldone.blogspot.co.uk/2014/03/mongoparallelaggregation.html

One of the main outcomes, was although a good speed-up can be achieved with multiple threads each running aggregate() on a subset of the collection's data, the main thing holding back further performance improvement was the threads queueing to write out to result collections in the same database, queueing for the DB write-lock.

In the tests, the $out operator http://docs.mongodb.org/master/reference/operator/aggregation/out/ is being used to specify different output collection for each thread's aggregate() invocation. However the $out operator does not allow one to specify a named database, in addition to a named collection. As a result, the same database as the aggregation's source collection is assumed and it's not possible to use different databases, to remove the write-lock bottleneck for such use cases.

Please consider enhancing the $out operator to support declaring a target database in addition to a target collection, in a similar manner to how this can already be achieved today in MongoDB's MapReduce function (specifically the mapReduce() function's 'out' option - http://docs.mongodb.org/manual/reference/method/db.collection.mapReduce/#mapreduce-out-mtd )

Thanks Paul



 Comments   
Comment by Kyle Suarez [ 28/Aug/18 ]

*edited to reflect renaming of output stage to existing collections to $merge*

As part of the new $merge features slated for MongoDB 4.2, users can use $merge to write the contents of an aggregation to a collection in a foreign database (that is, a database separate from the aggregation database). -This works for modes "insertDocuments" and "replaceDocuments", and is also supported in a sharded cluster.-

The output database is specified in the "db" field of the $merge specification:

> use test
switched to db test
> db.coll.find()
{ "_id" : "hello world" }
> db.coll.aggregate({$merge:{into: {db: "foreign", to: "output"}}})
> use foreign
switched to db foreign
> db.output.find()
{ "_id" : "hello world" }

Output to a foreign database is not yet supported for $out; that work will be tracked in SERVER-36832.

Comment by Githook User [ 28/Aug/18 ]

Author:

{'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com', 'username': 'ksuarz'}

Message: SERVER-13201 support $out to foreign database

Allows $out to write to a database foreign to the aggregation namespace
when the mode is "insertDocuments" or "replaceDocuments"
Branch: master
https://github.com/mongodb/mongo/commit/15d627c3b7b9b1b2ca4d2f729102f730a0568c1c

Comment by Jon Rangel (Inactive) [ 14/Apr/15 ]

This is also useful in a sharded cluster. If the output of aggregration can go to a different database then the outputs from different aggregations can be sent to different shards. Querying of those output collections does not then bottleneck on the primary shard of the source database.

Comment by John Butler [ 23/Apr/14 ]

The one the reasons this is very helpful is to write temp / search results to a DB that is not part of a replica set.

Generated at Thu Feb 08 03:30:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.