[SERVER-43021] MongoS server crashes when attempt to update single record > 16MB (16793648) Created: 23/Aug/19  Updated: 27/Dec/20  Resolved: 18/Feb/20

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 3.6.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Scott Glajch Assignee: Matthew Saltz (Inactive)
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongos_crash_1st.txt     Text File mongos_crash_2nd.txt     Text File mongos_crash_3rd.txt    
Issue Links:
Duplicate
is duplicated by SERVER-44345 MongoS crash with "BufBuilder attempt... Closed
Related
related to SERVER-29109 Client metadata log message verbosity... Open
related to SERVER-27663 Informational Network component log m... Open
related to SERVER-30841 Lower the amount of metadata refresh ... Closed
Operating System: ALL
Steps To Reproduce:

Try to update an existing record that is very close to 16MB with some more data, brining it over the 16MB mark, using a single user update

Sprint: Sharding 2019-09-09, Sharding 2019-09-23, Sharding 2019-10-07, Sharding 2019-12-02, Sharding 2019-12-16, Sharding 2019-12-30, Sharding 2020-02-10, Sharding 2020-02-24
Participants:
Case:

 Description   

I'm not certain if this would happen every time, but it did happen to us in production.

We had an object that was very close to 16MB (15.99MB according to bsonsize()), and our application went to update the record with a little more data.

The mongos that was being used then crashed with the following message:

2019-08-11T08:10:25.814+0000 F ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Uncaught exception in NetworkInterfaceASIO IO worker thread of type: Location10334: BSONObj size: 16794106 (0x10041FA) is invalid. Size must be between 0 and 16793600(16MB) First element: update: "<COLLECTION_NAME>"

FYI In the above and the full crash logs, the collection name is redacted to "<COLLECTION_NAME>".

Then our application, which tries to re-write this data periodically if the initial write fails, tried to write it a little later, and went to a different mongos server, which also crashed.  This caused our cluster to be effectively unavailable since both mongos nodes had crashed.

I've attached both stack traces.

 

Obviously we don't want to be running with DB objects at or close to 16MB, so we fixed the object in question to not be as big, but even though this isn't something we have happening all the time, it does happen occasionally and we expect to need to run our production servers with the ability for 16MB objects to gracefully fail to save in the future.

Our version is technically 3.6.6-evg1, which is a custom build we have branched directly off of 3.6.6, which you can find here https://github.com/evergage/mongo/commits/v3.6.6-evg1.  The only difference is the last 3 commits you see there which just quiets some extra verbose metadata logging that was eating basically infinite log entries and we had to silence in order to run this in production.  Since the changes are so minor, hopefully that means that the stack trace line numbers and such are still usable for you.  Since then that bug (https://jira.mongodb.org/browse/SERVER-30841?filter=21888) has been fixed in 3.6.8, and assuming that it silenced all the things we silenced in our custom build (3 different files), then we might be able to get off of running a custom build in the future.

 



 Comments   
Comment by Matthew Saltz (Inactive) [ 18/Feb/20 ]

We haven't heard back from you for some time, so I'm going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Comment by Danny Hatcher (Inactive) [ 31/Jan/20 ]

Hey Scott, have you had a chance to upload the files that Matthew asked for?

Comment by Matthew Saltz (Inactive) [ 06/Nov/19 ]

The logs for the router, shards, and config server for a short period (5 minutes I'm guessing would be sufficient) prior to the crash would be great. I don't think I need other diagnostic data, but for your reference there's a diagnostic data collection guide here.

For logging levels, if you're able to set the verbosity to log level 3, that would be great - hopefully that won't create an overwhelming amount of output. If it does, let me know, and I can try to make a finer grained suggestion.

Thanks!

Comment by Scott Glajch [ 05/Nov/19 ]

Sure I'll work on getting that data over to you now.

We actually had this crash happen again today, so I'll see if I can grab logs for both sets of events.

Is there a doc somewhere about the various types of logs and diagnostic data I should be sending you?  I can just grab the logs from the mongos server in question and the shard replica set members in question around the time period of the crash, but I'm not sure how far to go.  Do I give you the config server logs too?
I know there are those "diagnostic.data" files, but I'm unsure if I'm allowed to upload those as I don't think those in a format I can easily redact.

Finally if there are any logging levels you'd like me to set on the mongos going forward, I'll happily set those now.  Since that's been 2 crashes within a week's time, I'm hopeful I could get yet another crash to happen with increased logging data.

 

Comment by Danny Hatcher (Inactive) [ 05/Nov/19 ]

glajchs, you can use our Secure Uploader instead of JIRA if the file sizes for the requested logs are too big.

Comment by Matthew Saltz (Inactive) [ 04/Nov/19 ]

Hi glajchs, I've reopened the issue. Would it possible for you to provide us with the log output from that mongos leading up to the crash? Having logs from the shards might help too, if it's possible to trace exactly which operation caused this. This might help with diagnosis.

Comment by Scott Glajch [ 02/Nov/19 ]

I should note that we had just finished migrating our main (important) mongo cluster from 3.4 to 3.6, and just to be extra careful, for the last step we migrated only 1 of our 5 mongos nodes to 3.6.  I think if we had migrated all 5 nodes, all the mongos would have crashed and our system would be down.

This is super critical to root-cause and fix, otherwise we have to abort our 3.6 upgrade (for the 2nd year in a row) and stay with 3.4.

Is there a way I can re-open this bug?

Comment by Scott Glajch [ 02/Nov/19 ]

I just saw this issue happen again, this time about a day after moving our main cluster from 3.4.17-evg1 to 3.6.14.  Third stack trace attached.

Comment by Matthew Saltz (Inactive) [ 25/Sep/19 ]

I've filed a ticket (SERVER-43623) to make issues like the one in mongos_crash_1st easier to diagnose. Unfortunately without more information or a more specific reproducer I don't think we'll be able to make any definitive conclusions about that issue. Closing the issue, glajchs please feel to update and reopen if you have more information that you think would be helpful.

Comment by Matthew Saltz (Inactive) [ 30/Aug/19 ]

I noticed for the file mongos_crash_2nd.txt, the invariant you hit has since been removed by SERVER-40052, which was backported all the way to 3.6 (though I'm not sure which tag it was released in). So upgrading to include that commit should fix that issue.

The mongos_crash_1st is a bit harder to diagnose, since it just indicates that an unexpected exception was thrown somewhere in a networking thread, which doesn't really narrow things down, unfortunately. If you had a deterministic reproducer then that would help, and possibly including some of the logs preceding the crash could be helpful. kaloian.manassiev do you have a better idea of how to diagnose it?

Comment by Scott Glajch [ 26/Aug/19 ]

I understand that the mongos server is likely not crashing any time a >16MB record is attempted to be written, otherwise there would be a lot more crashes out in the wild.  There has to be something specific about the event that happened to us, and once that is narrowed down, feel free to change the title to something (slightly) less scary.  Unfortunately I don't have the full raw object as it was attempted to be written at the time, so I'm not sure how I'd be able to help reproduce this one better.

 

Thanks for the very helpful synopsis of those changes @Kevin Pulo!  I think for us the spam that was too much for our system was mostly centralized in change #1 and 3 there.  I think that, though a bit aggressive at times, #2 there is actually something we might still be logging in our custom build.  As is stated in the bug for that one, many customers find the connection logging useful to troubleshotting issues, and I can certainly agree with that.  We actually have our logging service setup to help us analyze which node might be at fault (or had a blip) using those logging messages.

We're currently using 3.6.6 for our raw event cluster and 3.4.19 for our core usage (much more important and real-time sensitive) cluster.  We're likely going to be replacing the raw event cluster with a more purpose tailored log structured storage database soonish, but the core cluster we need to get over to 3.6.  We've tried once but had to back out of the 3.6 upgrade for performance and bug issues, but it's about time for us to address those issues and get our cluster upgraded, at which point I'll look into usage the flags available to help manage the logging levels, and will definitely use your links.

Comment by Kevin Pulo [ 26/Aug/19 ]

Hi glajchs, regarding your 3 logging patches:

Comment by Danny Hatcher (Inactive) [ 23/Aug/19 ]

Thanks for creating this ticket. This should have been resolved in 3.6.1 via SERVER-32372 but it does appear to be the same error. I'll pass this along to the appropriate team to take a look.

Generated at Thu Feb 08 05:02:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.