[SERVER-32254] mongod crashes at higher application loads (multiple mongod version) Created: 11/Dec/17  Updated: 21/Mar/18  Resolved: 12/Feb/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.2.16, 3.4.5
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Jesse Beard Assignee: Mark Agarunov
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 14.04.5 LTS
XFS filesystem for data, local, and journal
VMWare ESXi 6.0.0, build 5224934
48GB ram
6 cores


Attachments: Text File mongo1-backtrace-v3.2.16-2017-12-05.log     Text File mongo1-backtrace-v3.4.5-2017-12-05.log     Text File mongo2-backtrace-v3.2.16-2017-12-11.log    
Operating System: ALL
Participants:

 Description   

We are experiencing crashes in both Mongo v.3.2.16 and v3.4.5 when operating at a higher application load.

We believe the crashes are similar or the same, but mongod logging is slightly different across the specified versions.

I can specify any additional information you may require beyond what is in the log files.

Logs attached.



 Comments   
Comment by Mark Agarunov [ 12/Feb/18 ]

Hello jesse.beard@captiveaire.com,

Thank you for the additional information. My apologies for the delay in response, unfortuantely we have not been able to diagnose a root cause for the errors you are seeing. The errors in the most recent files indicate data corruption at the disk or filesystem level however as Bruce mentioned, the fact that the errors seem unrelated to each other points to a hardware or underlying issue outside of mongodb. If this is still an issue for you, we can take another look if there have been additional crashes logged, but as I don't see anything to indicate a bug in MongoDB I've closed this ticket for the time being.

Thanks,
Mark

Comment by Jesse Beard [ 15/Dec/17 ]

Bruce,

We have experienced the crash again and I have uploaded full logs and diagnostics.data as requested. The logs represent server restart after initial data sync to the crash event. Logs stop at the crash. Instance ran for ~48 hours before crashing.

Uploaded file: captiveaire-crash-data-logs-2017-12-14_SERVER-32254.zip

The only environmental difference is that we did upgrade the replica set to 3.4.10.

Please let me know what else I can provide for you.

Comment by Bruce Lucas (Inactive) [ 12/Dec/17 ]

Thanks for mentioning those tickets. I don't see any relationship at the moment, but I will keep them in mind.

Do you know of any specific items to look at for VMWare ESXi beyond what is listed in the mongo best practices for production that may cause issues such as what I have posted?

I don't have any specific recommendations. I would look for errors or other unusual messages correlated in time with the crashes. If you would like to upload the guest syslog and the host VMWare system log files I can take take a look as well.

Comment by Jesse Beard [ 11/Dec/17 ]

Bruce,

Thank you for the response. I will upload additional logs and diagnostic data for you. The reason I am posting logs from two different versions of Mongo is that we thought the issue was isolated to the 3.4.x version range, so we slowly started backing down the replica set thinking the issue would subside once we reached what used to be a stable version for us which was 3.2.16. However, 3.2.16 is also crashing as indicated in the logs, but with a different crash log vs. what is logged by 3.4.5. I assumed maybe the difference was due to logging differences between the versions, but you are saying it is not.

Do you see possible similarities to the below?

https://jira.mongodb.org/browse/SERVER-31121
https://jira.mongodb.org/browse/SERVER-17713

I do realize the linked issues are dealing with zlib compression. Our replica set is not running any network compression, obviously not on 3.2.16, and is using the default of "none" when running 3.4.5 on the same replica set.

Do you know of any specific items to look at for VMWare ESXi beyond what is listed in the mongo best practices for production that may cause issues such as what I have posted?

Comment by Bruce Lucas (Inactive) [ 11/Dec/17 ]

Hi Jesse,

The three failures in the logs you attached have very different signatures, and none of them match known problems:

In mongo1-backtrace-v3.4.5-2017-12-05.log: WT panic due to encountering bad data

2017-12-05T03:50:35.095Z E STORAGE  [conn81395] WiredTiger error (0) [1512445835:92730][5812:0x7f9764a42700], file:local/collection/12-387512412807676176.wt, WT_CURSOR.insert: encountered an illegal file format or internal value
2017-12-05T03:50:35.100Z E STORAGE  [conn81395] WiredTiger error (-31804) [1512445835:100374][5812:0x7f9764a42700], file:local/collection/12-387512412807676176.wt, WT_CURSOR.insert: the process must exit and restart: WT_PANIC: WiredTiger library panic
***aborting after fassert() failure

In mongo1-backtrace-v3.2.16-2017-12-05.log: seg fault in __wt_page_in_func

2017-12-08T22:34:17.019Z F -        [conn33489] Got signal: 11 (Segmentation fault).
 mongod(__wt_page_in_func+0x1B7) [0x1a0a5a7]
 mongod(__wt_row_search+0x68F) [0x1a2ce7f]
 mongod(__wt_btcur_insert+0xB33) [0x19fa423]

In mongo2-backtrace-v3.2.16-2017-12-11.log: seg fault in QueryPlanner::cacheDataFromTaggedTree

2017-12-11T14:44:16.793Z F -        [conn42405] Got signal: 11 (Segmentation fault).
 mongod(_ZN5mongo12QueryPlanner23cacheDataFromTaggedTreeEPKNS_15MatchExpressionERKSt6vectorINS_10IndexEntryESaIS5_EEPPNS_18PlanCacheIndexTreeE+0x42) [0xe5f1c2]
 mongod(_ZN5mongo12QueryPlanner23cacheDataFromTaggedTreeEPKNS_15MatchExpressionERKSt6vectorINS_10IndexEntryESaIS5_EEPPNS_18PlanCacheIndexTreeE+0x34D) [0xe5f4cd]
 mongod(_ZN5mongo12QueryPlanner4planERKNS_14CanonicalQueryERKNS_18QueryPlannerParamsEPSt6vectorIPNS_13QuerySolutionESaIS9_EE+0x19DF) [0xe60fff]
 mongod(+0xA13C20) [0xe13c20]
 mongod(_ZN5mongo11getExecutorEPNS_16OperationContextEPNS_10CollectionESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS5_EENS_12PlanExecutor11YieldPolicyEm+0x74) [0xe15124]
 mongod(_ZN5mongo15getExecutorFindEPNS_16OperationContextEPNS_10CollectionERKNS_15NamespaceStringESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS8_EENS_12PlanExecutor11YieldPolicyE+0x7B) [0xe15cbb]

Given that you are seeing multiple different types of failures in two different versions of mongod, I would first suspect a hardware issue, or an issue with the virtualization software. Can you check system logs on both the host and guest operating systems for errors?

Also, if you can upload all available mongod log files and the archived content of the $dbpath/diagnostic.data directory to this secure upload portal we can take a look to see if we an spot some commonality between the failures.

Thanks,
Bruce

Comment by Jesse Beard [ 11/Dec/17 ]

This is a 3 data node replica set.

Comment by Jesse Beard [ 11/Dec/17 ]

Mongo VMs have host affinity in VMWare and have not been vMontioned during the time of these crashes.

Generated at Thu Feb 08 04:29:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.