[SERVER-27013] Allow results of the aggregation framework to exceed 16MB. Created: 13/Nov/16 Updated: 06/Dec/22 Resolved: 22/Dec/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | 3.2.11 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Alireza | Assignee: | Backlog - Query Team (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Query
|
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
We're doing aggregations on tens of millions of documents, every like 30 minutes and what we're facing is most of the time, aggregations fail since one of the returned "documents" exceed the strange 16Mb limit. I know that MongoDB's internal structure, the way it copies and moves documents into RAM and out of it, will be heavily influenced if this limit was to be increased for "Stored Documents", but aggregation results vary from this in a profound way, Most of the times we don't want to store them, we just want to use them. My usecase here is simple, I want to $group documents by an Id, and then $push ALL unprocessed documents that point to that ID into a sub-field of that document, get them an do some logic with them, and finally, bulk updating them to set them to processed. (We don't care about the RAM since we've thrown ~200GBs of RAM for this) It's really a headache that the 16MB limit is even in place for aggregation results ! Why is that required ? Is it just a code problem or is it a design decision ? (As a solution, we're currently evaluating ElasticSearch for what we do, but we were really happy with MongoDB other than the aggregation limits) |
| Comments |
| Comment by David Storch [ 22/Dec/16 ] | |||||||||||||||||||||||
|
Hi SpiXy, After some internal discussion involving asya and the MongoDB Query Team, we have decided to fold this request into a more general existing ticket about exceeding 16MB in command responses: SERVER-12305. Please watch that ticket for updates, and let me know if you have any further questions or concerns! I'm resolving as a duplicate. Best, | |||||||||||||||||||||||
| Comment by Alireza [ 20/Dec/16 ] | |||||||||||||||||||||||
|
@Asya Well actually, I can iterate over the result (message ids) in chunks and update them, It's actually better In my experience (since MongoDB has been faster updating several chunks of 100K documents, instead of an update involving millions of documents). What I can't achieve (and hence the problem), is that If I don't create the messages array, I won't have any reference to the rows that I have already aggregated.
will not be the same as doing this:
Reason being: thousands of new messages are being added every second and the chunk of messages that the $match stage in the pipeline has found, will not be the same as the ones that the $match will find in the update query (despite their queries looking the same ! since the update query is probably being run minutes after aggregation's find has been run [Aggregating takes about a minute]). So briefly put, I need reference to all rows that the current aggregation results are built on, so I can only update them, not anything else. | |||||||||||||||||||||||
| Comment by Asya Kamsky [ 20/Dec/16 ] | |||||||||||||||||||||||
|
SpiXy so you're saying the array size of "messages" for some of the values of "$nid" are bigger than (or close to) 16MBs? You won't be able to push that into an update command then - it's subjected to the same maximum document size limitation... | |||||||||||||||||||||||
| Comment by Alireza [ 08/Dec/16 ] | |||||||||||||||||||||||
|
Well sure, below is roughly how the aggregation looks like:
The actual pipeline is much longer but mostly consists of integer sums and conditions, the problem is caused by where I push all $ids of matched documents into a _messages field of a grouped document, to later easily update({}, {}) them, like:
Since the messages array is large (possibly made of 10s of millions of documents), It's been a headache to alter everything to work in chunks and idk, reduced the time interval of this aggregation to like every 5 minutes and ... | |||||||||||||||||||||||
| Comment by Asya Kamsky [ 08/Dec/16 ] | |||||||||||||||||||||||
|
SpiXy can you give an example of exactly what the aggregation is that results in the larger than 16MB document that gets returned to your application? Feel free to redact sensitive information, I don't need to see actual documents just roughly what they look like and what the pipeline looks like. |