Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-43021

MongoS server crashes when attempt to update single record > 16MB (16793648)

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Investigating
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: 3.6.6
    • Fix Version/s: None
    • Component/s: Stability
    • Labels:
      None
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      Try to update an existing record that is very close to 16MB with some more data, brining it over the 16MB mark, using a single user update

      Show
      Try to update an existing record that is very close to 16MB with some more data, brining it over the 16MB mark, using a single user update
    • Sprint:
      Sharding 2019-09-09, Sharding 2019-09-23, Sharding 2019-10-07, Sharding 2019-12-02, Sharding 2019-12-16

      Description

      I'm not certain if this would happen every time, but it did happen to us in production.

      We had an object that was very close to 16MB (15.99MB according to bsonsize()), and our application went to update the record with a little more data.

      The mongos that was being used then crashed with the following message:

      2019-08-11T08:10:25.814+0000 F ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Uncaught exception in NetworkInterfaceASIO IO worker thread of type: Location10334: BSONObj size: 16794106 (0x10041FA) is invalid. Size must be between 0 and 16793600(16MB) First element: update: "<COLLECTION_NAME>"
      

      FYI In the above and the full crash logs, the collection name is redacted to "<COLLECTION_NAME>".

      Then our application, which tries to re-write this data periodically if the initial write fails, tried to write it a little later, and went to a different mongos server, which also crashed.  This caused our cluster to be effectively unavailable since both mongos nodes had crashed.

      I've attached both stack traces.

       

      Obviously we don't want to be running with DB objects at or close to 16MB, so we fixed the object in question to not be as big, but even though this isn't something we have happening all the time, it does happen occasionally and we expect to need to run our production servers with the ability for 16MB objects to gracefully fail to save in the future.

      Our version is technically 3.6.6-evg1, which is a custom build we have branched directly off of 3.6.6, which you can find here https://github.com/evergage/mongo/commits/v3.6.6-evg1.  The only difference is the last 3 commits you see there which just quiets some extra verbose metadata logging that was eating basically infinite log entries and we had to silence in order to run this in production.  Since the changes are so minor, hopefully that means that the stack trace line numbers and such are still usable for you.  Since then that bug (https://jira.mongodb.org/browse/SERVER-30841?filter=21888) has been fixed in 3.6.8, and assuming that it silenced all the things we silenced in our custom build (3 different files), then we might be able to get off of running a custom build in the future.

       

        Attachments

        1. mongos_crash_1st.txt
          4 kB
        2. mongos_crash_2nd.txt
          8 kB
        3. mongos_crash_3rd.txt
          4 kB

          Issue Links

            Activity

              People

              • Votes:
                1 Vote for this issue
                Watchers:
                15 Start watching this issue

                Dates

                • Created:
                  Updated: