Core Server
  1. Core Server
  2. SERVER-431

Increase the 4mb BSON Object Limit to 16mb

    Details

    • Type: Improvement Improvement
    • Status: Closed Closed
    • Priority: Minor - P4 Minor - P4
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.7.4
    • Component/s: None
    • Labels:
      None
    • Backport:
      No
    • # Replies:
      25
    • Last comment by Customer:
      true

      Description

      Mostly for tracking who/how many others are interested in this, but it would be nice to have the option of >4MB objects.

      My specific use case is the storage of Twitter social graph data. It's not too much of an issue at the moment as it takes about a million id's to overflow the limit, but it's a "nice to have" to not have to hack up some other solution.

        Issue Links

          Activity

          Hide
          John Crenshaw
          added a comment -

          I think it is safe to say that everybody will accept any/all of the limits below without disappointment:
          1. BSON objects must be smaller than the chunk size
          2. BSON objects larger than 16MB they may be much slower to return in a query (and/or slower to query the portions beyond the 16MB threshold.)
          3. BSON objects must be smaller than 2GB on 32 bit systems (and some 64bit limit).
          4. BSON objects must be smaller than the amount of memory available to mongod.
          5. Any other obvious system limits

          The big problem is not whether we will normally want to store that much data in a single record, but whether it MIGHT get that large under extraordinary conditions. If we were dealing with records that were likely to get this large, we would be foolish to not restructure the code. Conversely, it seems rather silly to use a complicated model and have to send multiple queries to get the job done, just to avoid problems that might happen if somehow the structure becomes large enough to overflow the limits. The best model in this case (really) is the one that works best under 99.9% of conditions, but we can't use that model if it might overflow in the edge cases, even if it normally only overflows just a little. In real world terms, we're trying to avoid the case where that one user does something a bit strange (like writing a book in the comments), and overflows the record limits. Right now, avoiding this means restructuring the data into multiple collections and records anytime we don't have enough control over size or quantity of entries in an array.

          There are two types of structure that I can think of that might overflow in the edge cases. First:
          1. Collection contains an array (especially with recursive schema, which is a uniquely useful capability of document databases)
          2. Entries in this array might contain large chunks of data
          3. The content of the data segment might be important for query purposes

          Some things that I thought of that might be like this are:
          1. Storing the extracted contents of an archive (for querying or searching). (Even if the upload size is limited to just 1-2MB, there is a chance that an archive could overflow 16MB when extracted.)
          2. Raw email data (mime encoded) stored as a thread (99.9% of the time doing this in a single record is no problem, but eventually some nut will directly embed a huge family Christmas photo, send to the extended family, and get 20 replies back and forth where nobody deleted the original photo from the body before replying.)

          The second structure is slightly similar to the first:
          1. Collection contains an array or tree structure
          2. Array or tree might need to collect an unusually large number of nodes, even though nodes might be generally small in size.

          Some things that I thought of that might be like this are:
          1. Comments on an article (Scenario might be Digg + especially verbose commenters + especially aggressive spambots)
          2. The Twitter Social Graph (actually, any social graph of sufficient popularity that someone can collect a couple hundred thousand friends)
          3. Full Text Index for documents that are uploaded and stored elsewhere (Someone uploads the Enron emails.)
          4. Access logs for a user (can you imagine if a user used this "limit" to hide doing "bad things"?)
          5. Historical information on a record (think "history of changes" in Wikipedia on the "Health Care Bill" page)

          Sure, you can work around all these cases by adjusting the schema, but the most obvious schema, and the one that works best for 99.99% of the records in these cases, can't be used, because it might overflow at just the worst time. Adjusting the schema generally requires mountains of additional application code, and is less stable. This is why people are hoping for a system that manages to "somehow" behave itself when things go beyond the "normal" limits.

          Show
          John Crenshaw
          added a comment - I think it is safe to say that everybody will accept any/all of the limits below without disappointment: 1. BSON objects must be smaller than the chunk size 2. BSON objects larger than 16MB they may be much slower to return in a query (and/or slower to query the portions beyond the 16MB threshold.) 3. BSON objects must be smaller than 2GB on 32 bit systems (and some 64bit limit). 4. BSON objects must be smaller than the amount of memory available to mongod. 5. Any other obvious system limits The big problem is not whether we will normally want to store that much data in a single record, but whether it MIGHT get that large under extraordinary conditions. If we were dealing with records that were likely to get this large, we would be foolish to not restructure the code. Conversely, it seems rather silly to use a complicated model and have to send multiple queries to get the job done, just to avoid problems that might happen if somehow the structure becomes large enough to overflow the limits. The best model in this case (really) is the one that works best under 99.9% of conditions, but we can't use that model if it might overflow in the edge cases, even if it normally only overflows just a little. In real world terms, we're trying to avoid the case where that one user does something a bit strange (like writing a book in the comments), and overflows the record limits. Right now, avoiding this means restructuring the data into multiple collections and records anytime we don't have enough control over size or quantity of entries in an array. There are two types of structure that I can think of that might overflow in the edge cases. First: 1. Collection contains an array (especially with recursive schema, which is a uniquely useful capability of document databases) 2. Entries in this array might contain large chunks of data 3. The content of the data segment might be important for query purposes Some things that I thought of that might be like this are: 1. Storing the extracted contents of an archive (for querying or searching). (Even if the upload size is limited to just 1-2MB, there is a chance that an archive could overflow 16MB when extracted.) 2. Raw email data (mime encoded) stored as a thread (99.9% of the time doing this in a single record is no problem, but eventually some nut will directly embed a huge family Christmas photo, send to the extended family, and get 20 replies back and forth where nobody deleted the original photo from the body before replying.) The second structure is slightly similar to the first: 1. Collection contains an array or tree structure 2. Array or tree might need to collect an unusually large number of nodes, even though nodes might be generally small in size. Some things that I thought of that might be like this are: 1. Comments on an article (Scenario might be Digg + especially verbose commenters + especially aggressive spambots) 2. The Twitter Social Graph (actually, any social graph of sufficient popularity that someone can collect a couple hundred thousand friends) 3. Full Text Index for documents that are uploaded and stored elsewhere (Someone uploads the Enron emails.) 4. Access logs for a user (can you imagine if a user used this "limit" to hide doing "bad things"?) 5. Historical information on a record (think "history of changes" in Wikipedia on the "Health Care Bill" page) Sure, you can work around all these cases by adjusting the schema, but the most obvious schema, and the one that works best for 99.99% of the records in these cases, can't be used, because it might overflow at just the worst time. Adjusting the schema generally requires mountains of additional application code, and is less stable. This is why people are hoping for a system that manages to "somehow" behave itself when things go beyond the "normal" limits.
          Hide
          Roger Binns
          added a comment -

          @Eliot: The problem is that there is no easy workaround. Any diligent developer is going to worry about these boundary conditions and the point of putting the data in a database is because you really need the data saved. If the database rejects the data then you have to code a plan B which is a lot of work to foist on every application. You saw how much more work I had to in an earlier message and even that is far more brittle and has far more failure modes. (I also haven't written test code for it yet, but that is going to be a huge amount more.) This arbitrary limit means every client has to be coded with two ways of accessing data - regular and oversize. Solving it once at the database layer for all clients is far more preferable.

          I very much agree with John's list of five. Note that none of those numbers are arbitrary whereas the current limit is. I'll also admit that I was one of those people thinking that the 4MB limit is perfectly fine and anyone going over it wasn't dealing with their data design well. Right up till the moment my data legitimately went over 4MB ...

          Show
          Roger Binns
          added a comment - @Eliot: The problem is that there is no easy workaround. Any diligent developer is going to worry about these boundary conditions and the point of putting the data in a database is because you really need the data saved. If the database rejects the data then you have to code a plan B which is a lot of work to foist on every application. You saw how much more work I had to in an earlier message and even that is far more brittle and has far more failure modes. (I also haven't written test code for it yet, but that is going to be a huge amount more.) This arbitrary limit means every client has to be coded with two ways of accessing data - regular and oversize. Solving it once at the database layer for all clients is far more preferable. I very much agree with John's list of five. Note that none of those numbers are arbitrary whereas the current limit is. I'll also admit that I was one of those people thinking that the 4MB limit is perfectly fine and anyone going over it wasn't dealing with their data design well. Right up till the moment my data legitimately went over 4MB ...
          Hide
          Eliot Horowitz
          added a comment -

          We still believe the benefits of limiting to a fixed size outweigh the benefits of no max size.

          Can you open a new ticket to track interest/thoughts.

          This ticket won't change for sure, and definitely not before 1.8

          Show
          Eliot Horowitz
          added a comment - We still believe the benefits of limiting to a fixed size outweigh the benefits of no max size. Can you open a new ticket to track interest/thoughts. This ticket won't change for sure, and definitely not before 1.8
          Hide
          Ron Mayer
          added a comment -

          Eliot wrote: "There is always going to be a limit, even if its crazy high like 2gb. So its really a question of what it is."

          It that's the question, my vote would be for "crazy high like 2gb".

          Well over 99.99% of documents I'm storing fit comfortably in 4MB. However source data we're bringing into MongoDB (xml docs in this format: http://www.niem.gov/index.php from hundreds of government systems) doesn't have any hard constraints on the size of their documents.

          Yes, it's understandable that a huge document would be slow.

          No, it's not an option to simply drop the document.

          And it does kinda suck to have to code differently for the one-in-ten-thousand large documents.

          Show
          Ron Mayer
          added a comment - Eliot wrote: "There is always going to be a limit, even if its crazy high like 2gb. So its really a question of what it is." It that's the question, my vote would be for "crazy high like 2gb". Well over 99.99% of documents I'm storing fit comfortably in 4MB. However source data we're bringing into MongoDB (xml docs in this format: http://www.niem.gov/index.php from hundreds of government systems) doesn't have any hard constraints on the size of their documents. Yes, it's understandable that a huge document would be slow. No, it's not an option to simply drop the document. And it does kinda suck to have to code differently for the one-in-ten-thousand large documents.
          Hide
          Roger Binns
          added a comment -

          Is there a ticket for getting rid of this limit (or having it like John suggested)?

          I'm now hitting the 16MB which means I have to write and test two code paths - one for the majority of data and one for the outliers. We don't run MongoDB on any machine with less than 32GB of RAM so the current arbitrary limit does not help me in any way. In fact it makes me waste time having to write more code and testing.

          Show
          Roger Binns
          added a comment - Is there a ticket for getting rid of this limit (or having it like John suggested)? I'm now hitting the 16MB which means I have to write and test two code paths - one for the majority of data and one for the outliers. We don't run MongoDB on any machine with less than 32GB of RAM so the current arbitrary limit does not help me in any way. In fact it makes me waste time having to write more code and testing.

            People

            • Votes:
              31 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                44 weeks, 6 days ago
                Date of 1st Reply: