[SERVER-27013] Allow results of the aggregation framework to exceed 16MB. Created: 13/Nov/16  Updated: 06/Dec/22  Resolved: 22/Dec/16

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 3.2.11
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Alireza Assignee: Backlog - Query Team (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-12305 Allow command request and response BS... Backlog
Related
is related to SERVER-5923 Increase max document size to at leas... Closed
Assigned Teams:
Query
Participants:

 Description   

We're doing aggregations on tens of millions of documents, every like 30 minutes and what we're facing is most of the time, aggregations fail since one of the returned "documents" exceed the strange 16Mb limit.

I know that MongoDB's internal structure, the way it copies and moves documents into RAM and out of it, will be heavily influenced if this limit was to be increased for "Stored Documents", but aggregation results vary from this in a profound way, Most of the times we don't want to store them, we just want to use them.

My usecase here is simple, I want to $group documents by an Id, and then $push ALL unprocessed documents that point to that ID into a sub-field of that document, get them an do some logic with them, and finally, bulk updating them to set them to processed. (We don't care about the RAM since we've thrown ~200GBs of RAM for this)

It's really a headache that the 16MB limit is even in place for aggregation results ! Why is that required ? Is it just a code problem or is it a design decision ?

(As a solution, we're currently evaluating ElasticSearch for what we do, but we were really happy with MongoDB other than the aggregation limits)



 Comments   
Comment by David Storch [ 22/Dec/16 ]

Hi SpiXy,

After some internal discussion involving asya and the MongoDB Query Team, we have decided to fold this request into a more general existing ticket about exceeding 16MB in command responses: SERVER-12305. Please watch that ticket for updates, and let me know if you have any further questions or concerns! I'm resolving as a duplicate.

Best,
Dave

Comment by Alireza [ 20/Dec/16 ]

@Asya

Well actually, I can iterate over the result (message ids) in chunks and update them, It's actually better In my experience (since MongoDB has been faster updating several chunks of 100K documents, instead of an update involving millions of documents).

What I can't achieve (and hence the problem), is that If I don't create the messages array, I won't have any reference to the rows that I have already aggregated.
See? Doing the aggregation, and then updating the documents like:

update({"is_aggregated": False}, {"$set": {"is_aggregated": True}}, multi=True)

will not be the same as doing this:

update({"_id": { "$in": messages}}, {"$set": {"is_aggregated": True}}, multi=True);

Reason being: thousands of new messages are being added every second and the chunk of messages that the $match stage in the pipeline has found, will not be the same as the ones that the $match will find in the update query (despite their queries looking the same ! since the update query is probably being run minutes after aggregation's find has been run [Aggregating takes about a minute]).

So briefly put, I need reference to all rows that the current aggregation results are built on, so I can only update them, not anything else.
Without being able to create that messages array, it's gonna be much harder to implement this.
Hope it's clear now.

Comment by Asya Kamsky [ 20/Dec/16 ]

SpiXy so you're saying the array size of "messages" for some of the values of "$nid" are bigger than (or close to) 16MBs? You won't be able to push that into an update command then - it's subjected to the same maximum document size limitation...

Comment by Alireza [ 08/Dec/16 ]

Well sure, below is roughly how the aggregation looks like:

{
    "$match": {
        'is_aggregated': {
            '$ne': True
        },
        OTHER FILTERS
    }
},
{
    "$project":{
        SOME PROJECTION           
    }
},
{
    "$group": {
        "_id": "$nid",
        'messages': {
            '$push': '$_id'
        },
        OTHER_THINGS
    }
}

The actual pipeline is much longer but mostly consists of integer sums and conditions, the problem is caused by where I push all $ids of matched documents into a _messages field of a grouped document, to later easily update({}, {}) them, like:

update({"_id": { "$in": messages}}, {"$set": {"is_aggregated": True}}, multi=True);

Since the messages array is large (possibly made of 10s of millions of documents), It's been a headache to alter everything to work in chunks and idk, reduced the time interval of this aggregation to like every 5 minutes and ...

Comment by Asya Kamsky [ 08/Dec/16 ]

SpiXy can you give an example of exactly what the aggregation is that results in the larger than 16MB document that gets returned to your application? Feel free to redact sensitive information, I don't need to see actual documents just roughly what they look like and what the pipeline looks like.

Generated at Thu Feb 08 04:13:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.