[SERVER-431] Increase the 4mb BSON Object Limit to 16mb Created: 19/Nov/09  Updated: 17/Sep/21  Resolved: 09/Dec/10

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 1.7.4

Type: Improvement Priority: Minor - P4
Reporter: Damon Cortesi Assignee: Eliot Horowitz (Inactive)
Resolution: Done Votes: 31
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-1873 group can only process 10k unique keys Closed
Participants:

 Description   

Mostly for tracking who/how many others are interested in this, but it would be nice to have the option of >4MB objects.

My specific use case is the storage of Twitter social graph data. It's not too much of an issue at the moment as it takes about a million id's to overflow the limit, but it's a "nice to have" to not have to hack up some other solution.



 Comments   
Comment by Ivan Fioravanti [ 17/Sep/21 ]

Sorry to vote on an old ticket, but 11 years after the raise from 4MB to 16MB, I think that a raise to 64MB makes sense, no? I've opened a new ticket to consider this here: https://jira.mongodb.org/browse/SERVER-60040?filter=-4 

Comment by senthil [ 10/Apr/17 ]

We are in book publishing industry, we have lots of book metadata information and get them back to reports, we are able to afford better infrastructure of > 256 GB of RAM and quad core multiprocessors and SSD, it certainly does not meet our requirements. Please don't restrict this limit as its a hinderance for users who use mongodb.

Comment by Lewis Geer [ 26/Feb/15 ]

Hi,

Sorry to comment on an old ticket, but there are real world use cases for large documents, especially for biomedical applications. Let's say we have a collection of possible drugs. Some of these drugs we know almost nothing about, perhaps a registry name, a supplier, and a chemical structure. Others, like aspirin or penicillin, we know a whole lot about: clinical studies, pharmacology, and so on. So the average document is relatively small, but there are a few documents that are huge. You can't omit these huge documents as they are of great interest. This happens over and over again in biomedical databases, for example, you might know a lot about an organism named "human", but not a lot about "tasseled wobbegongs" and most other organisms. Of course, this can be coded around, but it would be nice not to be forced to do this and might help adoption of mongodb in organizations that deal with biomedical information, like large research organizations.

Thanks,
Lewis

Comment by Roger Binns [ 05/Jun/13 ]

Is there a ticket for getting rid of this limit (or having it like John suggested)?

I'm now hitting the 16MB which means I have to write and test two code paths - one for the majority of data and one for the outliers. We don't run MongoDB on any machine with less than 32GB of RAM so the current arbitrary limit does not help me in any way. In fact it makes me waste time having to write more code and testing.

Comment by Ron Mayer [ 23/Mar/11 ]

Eliot wrote: "There is always going to be a limit, even if its crazy high like 2gb. So its really a question of what it is."

It that's the question, my vote would be for "crazy high like 2gb".

Well over 99.99% of documents I'm storing fit comfortably in 4MB. However source data we're bringing into MongoDB (xml docs in this format: http://www.niem.gov/index.php from hundreds of government systems) doesn't have any hard constraints on the size of their documents.

Yes, it's understandable that a huge document would be slow.

No, it's not an option to simply drop the document.

And it does kinda suck to have to code differently for the one-in-ten-thousand large documents.

Comment by Eliot Horowitz (Inactive) [ 12/Jan/11 ]

We still believe the benefits of limiting to a fixed size outweigh the benefits of no max size.

Can you open a new ticket to track interest/thoughts.

This ticket won't change for sure, and definitely not before 1.8

Comment by Roger Binns [ 12/Jan/11 ]

@Eliot: The problem is that there is no easy workaround. Any diligent developer is going to worry about these boundary conditions and the point of putting the data in a database is because you really need the data saved. If the database rejects the data then you have to code a plan B which is a lot of work to foist on every application. You saw how much more work I had to in an earlier message and even that is far more brittle and has far more failure modes. (I also haven't written test code for it yet, but that is going to be a huge amount more.) This arbitrary limit means every client has to be coded with two ways of accessing data - regular and oversize. Solving it once at the database layer for all clients is far more preferable.

I very much agree with John's list of five. Note that none of those numbers are arbitrary whereas the current limit is. I'll also admit that I was one of those people thinking that the 4MB limit is perfectly fine and anyone going over it wasn't dealing with their data design well. Right up till the moment my data legitimately went over 4MB ...

Comment by John Crenshaw [ 12/Jan/11 ]

I think it is safe to say that everybody will accept any/all of the limits below without disappointment:
1. BSON objects must be smaller than the chunk size
2. BSON objects larger than 16MB they may be much slower to return in a query (and/or slower to query the portions beyond the 16MB threshold.)
3. BSON objects must be smaller than 2GB on 32 bit systems (and some 64bit limit).
4. BSON objects must be smaller than the amount of memory available to mongod.
5. Any other obvious system limits

The big problem is not whether we will normally want to store that much data in a single record, but whether it MIGHT get that large under extraordinary conditions. If we were dealing with records that were likely to get this large, we would be foolish to not restructure the code. Conversely, it seems rather silly to use a complicated model and have to send multiple queries to get the job done, just to avoid problems that might happen if somehow the structure becomes large enough to overflow the limits. The best model in this case (really) is the one that works best under 99.9% of conditions, but we can't use that model if it might overflow in the edge cases, even if it normally only overflows just a little. In real world terms, we're trying to avoid the case where that one user does something a bit strange (like writing a book in the comments), and overflows the record limits. Right now, avoiding this means restructuring the data into multiple collections and records anytime we don't have enough control over size or quantity of entries in an array.

There are two types of structure that I can think of that might overflow in the edge cases. First:
1. Collection contains an array (especially with recursive schema, which is a uniquely useful capability of document databases)
2. Entries in this array might contain large chunks of data
3. The content of the data segment might be important for query purposes

Some things that I thought of that might be like this are:
1. Storing the extracted contents of an archive (for querying or searching). (Even if the upload size is limited to just 1-2MB, there is a chance that an archive could overflow 16MB when extracted.)
2. Raw email data (mime encoded) stored as a thread (99.9% of the time doing this in a single record is no problem, but eventually some nut will directly embed a huge family Christmas photo, send to the extended family, and get 20 replies back and forth where nobody deleted the original photo from the body before replying.)

The second structure is slightly similar to the first:
1. Collection contains an array or tree structure
2. Array or tree might need to collect an unusually large number of nodes, even though nodes might be generally small in size.

Some things that I thought of that might be like this are:
1. Comments on an article (Scenario might be Digg + especially verbose commenters + especially aggressive spambots)
2. The Twitter Social Graph (actually, any social graph of sufficient popularity that someone can collect a couple hundred thousand friends)
3. Full Text Index for documents that are uploaded and stored elsewhere (Someone uploads the Enron emails.)
4. Access logs for a user (can you imagine if a user used this "limit" to hide doing "bad things"?)
5. Historical information on a record (think "history of changes" in Wikipedia on the "Health Care Bill" page)

Sure, you can work around all these cases by adjusting the schema, but the most obvious schema, and the one that works best for 99.99% of the records in these cases, can't be used, because it might overflow at just the worst time. Adjusting the schema generally requires mountains of additional application code, and is less stable. This is why people are hoping for a system that manages to "somehow" behave itself when things go beyond the "normal" limits.

Comment by Eliot Horowitz (Inactive) [ 12/Jan/11 ]

The argument can be made at 501mb and 17mb, so once there's a limit, there's a limit.

Some hard technical limits is that an object has to fit in ram.

Can you give an example of your schema where you'd want documents that large?

Comment by Julian Morrison [ 11/Jan/11 ]

The problem is not 500mb documents, it's situations where you can't be certain a document will never be 17mb. This will still be true of ANY fixed limit. Is there a technical reason it would be impossible for documents to just grow up to the bounds of storage, if necessary? You can still warn people that performance suffers unless documents are mostly small.

Comment by Eliot Horowitz (Inactive) [ 11/Jan/11 ]

There is always going to be a limit, even if its crazy high like 2gb.
So its really a question of what it is.

If you had a 500mb document, performance would be really really bad.
We also thing that at that point, its generally better to change the schema.

So 16mb seems to be the best of both worlds.

When 1.8 is out for a while, we can look again.

Comment by John Crenshaw [ 10/Jan/11 ]

I do see the value to the increase to another arbitrary limit (4MB was feeling a little cramped on the edge cases, so the increase gives room to breath and feels GREAT), but I also understand where Walt is coming from. Are there any plans to allow the limit to be removed entirely? "Not breaking" is always more important to me than "uniform performance", so if performance is uniform...except in the cases where it would break right now, in which case it at least works...I'm a happy camper. (Besides, anything that grows that big probably triggered a lot of "non uniform performance" long before it got to Mongo). Now that you've already changed it once, I imagine that the "driver assumptions" reason is an acceptable loss. Am I missing something?

In any case, it's nice to be up to 16MB. Thanks!

Comment by Walt Woods [ 09/Jan/11 ]

I really don't understand the rationale for increasing this to another static limit... The whole issue is that it should be a flexible setting (e.g. runtime configuration) dependent on the use case of MongoDb.

Comment by Roger Binns [ 09/Jan/11 ]

I'm another user bitten by this arbitrary limit. In one of my schemas documents represent a file. I generate an opaque binary blob index of each file (pickled Python data structure behind the scenes) and stick it in the document too. A recent algorithm change means that this binary blob grew larger. (It is a temporary change and will be optimized to a smaller size later). For about 3 percent of my files the blob is now larger than 4MB. (Largest is 11MB, compression halves the size). I am running on a server with 24GB of RAM so these small sizes are trivial.

I had to write new code to do the following:

  • Divert the blob to gridfs if over sized
  • Change all the lookups that before just grabbed one key to also look in gridfs if necessary
  • Drop code that looked for dupes if sent to gridfs
  • Garbage collect gridfs (remove old versions of files)
  • Write an fsck tool that makes sure every document that references something in gridfs actually has the corresponding gridfs item present, else regenerate
  • For every item in gridfs make sure that it is referenced by a document, else delete
  • Worry about the ordering and lack of transactions because now two documents are involved (the main one and gridfs). And behind the scenes gridfs uses multiple documents that reference each other. Since Mongo doesn't have transactions spanning more than one document it is very important that the ordering of these is done right so as not to break other code and that failures are recoverable.

ie this arbitrary limit forced me to do a lot more work, increased code complexity and made everything far more brittle. I'd have no issue with the limit being in the hundreds of megabytes range but a handful of megabytes really doesn't help.

Comment by auto [ 09/Dec/10 ]

Author:

{u'login': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}

Message: increase bson size to 16mb SERVER-431
https://github.com/mongodb/mongo/commit/29d9dab034c6f7df497b76f543ba467e687a9063

Comment by Eliot Horowitz (Inactive) [ 12/Nov/10 ]

we're going to go 16 for 1.8

Comment by auto [ 12/Oct/10 ]

Author:

{'login': 'erh', 'name': 'Eliot Horowitz', 'email': 'eliot@10gen.com'}

Message: increase bson size to 8mb SERVER-1918 SERVER-431
http://github.com/mongodb/mongo/commit/b357c3ea89ef9374dd775c326f75d404bebe7f68

Comment by auto [ 11/Oct/10 ]

Author:

{'login': 'erh', 'name': 'Eliot Horowitz', 'email': 'eliot@10gen.com'}

Message: split bson max size into User and Internal
sometimes objects have to be bigger (oplog insertion for example)
more prep for SERVER-431
http://github.com/mongodb/mongo/commit/53a0d295e32505a12d844ffc49fa5a07c8c9081c

Comment by auto [ 11/Oct/10 ]

Author:

{'login': 'erh', 'name': 'Eliot Horowitz', 'email': 'eliot@10gen.com'}

Message: using BSONObjMaxSize everywhere bson size comes into play
still 4mb, this is just prep for changing SERVER-431
http://github.com/mongodb/mongo/commit/d3c3b8a9032e3cd125ebb09bec967441f1451753

Comment by John Crenshaw [ 08/Oct/10 ]

Ditto what Walt said. A compilation flag is too inflexible. I'm also paranoid. With every line of code I write I think "how could this break", which makes data modeling for Mongo pure torture. I'd love to see the 4MB limit just disappear. I don't care if a document becomes slower to work with after the 4MB limit breaks down.

Although only slightly related, I also share Walt's frustration with being unable to return only a matching embedded document.Consider the case of comments on a blog post, which would be placed in the blog post record; but, that makes it nearly impossible to create an admin administration section that deals with comments separately from the post. I think you can do it with MapReduce, but that seems like a really nasty way of doing it, and I expect it would slow things way down.

Returning only the embedded document would of course be really helpful for document types that may become large. It's a lot nicer to return 2k of embedded documents, rather than 4MB of document.

Comment by Walt Woods [ 07/Oct/10 ]

@Leon Mergen - Yeah; I really wouldn't want it to be a compilation flag though. I'd much prefer the flexibility to specify for different database storage points. A runtime configuration option.

Even an api call for configuring a specific collection would be a good idea, as it might help prevent overflow in collections that really shouldn't have more than a 4MB limit.

Also, unlimited would be nice.... Yes I'm paranoid...

Comment by Leon Mergen [ 07/Oct/10 ]

If we could build our own mongo server with a --max_object_size=16MB for example, and the 10gen-hosted binary being 4MB, that would be perfectly acceptable to me. I just like to gain a bit more control over the max size, instead of being dictated what my max object size should be, as long as i know the consequences for "disobeying" the recommended max object size.

Comment by Walt Woods [ 07/Oct/10 ]

This is actually the reason I'm not porting my app over to MongoDb. I think MongoDb would be a better fit than CouchDb for my app, due to things like $push, $pull, and eventually virtual / indexed collections (it's also pushing me away that I can filter on

{ 'foo.bar': 3 }

, but not grab only the matching embedded document), but I can't actually use these features unless I'm sure that my application won't throw errors when e.g. a related object list grows too large (these embedded documents are very small and numerous, and shouldn't be their own documents. They are in CouchDb, and they would have to be in MongoDb at the moment as well).

Maybe at the very least, there could be a mongod flag that allows overriding of the BSON size limit, "for experienced users willing to accept the risks". That would be a good compromise.

Comment by Leon Mergen [ 15/Jul/10 ]

+1 for this fix, I concur Julian's comment. In some rare cases, we might hit more than 4MB documents, degrading performance wouldn't be a problem, data loss/exceptions would.

Comment by Khash Sajadi [ 14/Jul/10 ]

I would love to see the limit increase. We're storing web pages in documents and this would help a lot.

Comment by David Lee [ 17/Jun/10 ]

Perhaps the 4mb limit could be taken out in favor on adding documentation that mongodb works best with small documents. Like Julian, I also worry that the 4mb limit would cause problems on some rare cases.

Comment by Julian Morrison [ 23/May/10 ]

A hard limit is a fundamentally different kind of thing to degrading performance, even a steep degradation. What it means is that if your data might ever, ever approach the 4MB limit, even under fringe exceptional circumstances (comment thread featured on Digg, etc), then you are going to have to split your data across multiple objects, even if that's a lot of extra code and requires pointless extra queries and CPU work in the less-than-4mb case. Spikes of load outside the norm always do happen, and they're the worst time to have a site break. So if you find a way to make this limit go away, it makes designing an app that uses MongoDB a lot easier.

Comment by Eliot Horowitz (Inactive) [ 19/Nov/09 ]

the 4mb limit isn't a hard limit per se, its easy to change.
the reson its there and we really like it is that it keeps performance uniform, lets drivers make some assumption about input to make, and generally prevents really horrible things from happening.

if there is a large consensus that it should change however, we certainly could.

Generated at Thu Feb 08 02:54:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.