[SERVER-32254] mongod crashes at higher application loads (multiple mongod version) Created: 11/Dec/17 Updated: 21/Mar/18 Resolved: 12/Feb/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.2.16, 3.4.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Jesse Beard | Assignee: | Mark Agarunov |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 14.04.5 LTS |
||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
We are experiencing crashes in both Mongo v.3.2.16 and v3.4.5 when operating at a higher application load. We believe the crashes are similar or the same, but mongod logging is slightly different across the specified versions. I can specify any additional information you may require beyond what is in the log files. Logs attached. |
| Comments |
| Comment by Mark Agarunov [ 12/Feb/18 ] | ||||||||||||||
|
Hello jesse.beard@captiveaire.com, Thank you for the additional information. My apologies for the delay in response, unfortuantely we have not been able to diagnose a root cause for the errors you are seeing. The errors in the most recent files indicate data corruption at the disk or filesystem level however as Bruce mentioned, the fact that the errors seem unrelated to each other points to a hardware or underlying issue outside of mongodb. If this is still an issue for you, we can take another look if there have been additional crashes logged, but as I don't see anything to indicate a bug in MongoDB I've closed this ticket for the time being. Thanks, | ||||||||||||||
| Comment by Jesse Beard [ 15/Dec/17 ] | ||||||||||||||
|
Bruce, We have experienced the crash again and I have uploaded full logs and diagnostics.data as requested. The logs represent server restart after initial data sync to the crash event. Logs stop at the crash. Instance ran for ~48 hours before crashing. Uploaded file: captiveaire-crash-data-logs-2017-12-14_ The only environmental difference is that we did upgrade the replica set to 3.4.10. Please let me know what else I can provide for you. | ||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 12/Dec/17 ] | ||||||||||||||
|
Thanks for mentioning those tickets. I don't see any relationship at the moment, but I will keep them in mind.
I don't have any specific recommendations. I would look for errors or other unusual messages correlated in time with the crashes. If you would like to upload the guest syslog and the host VMWare system log files I can take take a look as well. | ||||||||||||||
| Comment by Jesse Beard [ 11/Dec/17 ] | ||||||||||||||
|
Bruce, Thank you for the response. I will upload additional logs and diagnostic data for you. The reason I am posting logs from two different versions of Mongo is that we thought the issue was isolated to the 3.4.x version range, so we slowly started backing down the replica set thinking the issue would subside once we reached what used to be a stable version for us which was 3.2.16. However, 3.2.16 is also crashing as indicated in the logs, but with a different crash log vs. what is logged by 3.4.5. I assumed maybe the difference was due to logging differences between the versions, but you are saying it is not. Do you see possible similarities to the below? https://jira.mongodb.org/browse/SERVER-31121 I do realize the linked issues are dealing with zlib compression. Our replica set is not running any network compression, obviously not on 3.2.16, and is using the default of "none" when running 3.4.5 on the same replica set. Do you know of any specific items to look at for VMWare ESXi beyond what is listed in the mongo best practices for production that may cause issues such as what I have posted? | ||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 11/Dec/17 ] | ||||||||||||||
|
Hi Jesse, The three failures in the logs you attached have very different signatures, and none of them match known problems: In mongo1-backtrace-v3.4.5-2017-12-05.log: WT panic due to encountering bad data
In mongo1-backtrace-v3.2.16-2017-12-05.log: seg fault in __wt_page_in_func
In mongo2-backtrace-v3.2.16-2017-12-11.log: seg fault in QueryPlanner::cacheDataFromTaggedTree
Given that you are seeing multiple different types of failures in two different versions of mongod, I would first suspect a hardware issue, or an issue with the virtualization software. Can you check system logs on both the host and guest operating systems for errors? Also, if you can upload all available mongod log files and the archived content of the $dbpath/diagnostic.data directory to this secure upload portal we can take a look to see if we an spot some commonality between the failures. Thanks, | ||||||||||||||
| Comment by Jesse Beard [ 11/Dec/17 ] | ||||||||||||||
|
This is a 3 data node replica set. | ||||||||||||||
| Comment by Jesse Beard [ 11/Dec/17 ] | ||||||||||||||
|
Mongo VMs have host affinity in VMWare and have not been vMontioned during the time of these crashes. |