[SERVER-10722] MongoDB periodically segfaults, no error messages in mongo logs Created: 09/Sep/13  Updated: 18/Mar/15  Resolved: 18/Mar/15

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Dharshan Rangegowda Assignee: Samantha Ritter (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Linux
Participants:

 Description   

My mongodb process dies arbitrarily without any logs. It has happened a couple of times over the last few weeks. I had to manually start the mongod server. I looked at the system logs and it was not an OOM kill.

I am on AWS, Amzon linux XLarge instance.

Here are the mongod logs from when it died. I would like to understand if I can turn on some logging to determine the root cause if it happens again.

Mon Sep  9 10:38:28.554 [conn895640] end connection 10.139.25.254:42073 (246 connections now open)
Mon Sep  9 10:38:28.554 [conn895642] end connection 10.139.25.254:42074 (248 connections now open)
Mon Sep  9 10:38:28.594 [conn895644] end connection 10.139.25.254:42076 (242 connections now open)
Mon Sep  9 10:38:28.594 [conn895648] end connection 10.139.25.254:42080 (241 connections now open)
Mon Sep  9 10:38:28.594 [conn895652] end connection 10.139.25.254:42084 (240 connections now open)
Mon Sep  9 10:38:28.595 [conn895646] end connection 10.139.25.254:42079 (239 connections now open)
Mon Sep  9 10:38:28.595 [conn895650] end connection 10.139.25.254:42082 (239 connections now open)
Mon Sep  9 10:38:28.595 [conn895653] end connection 10.139.25.254:42086 (237 connections now open)
Mon Sep  9 10:38:28.595 [initandlisten] connection accepted from 10.139.25.254:42110 #895718 (239 connections now open)
Mon Sep  9 10:38:28.595 [initandlisten] connection accepted from 10.139.25.254:42109 #895719 (239 connections now open)
Mon Sep  9 10:38:28.596 [conn895647] end connection 10.139.25.254:42078 (238 connections now open)
Mon Sep  9 10:38:28.596 [conn895651] end connection 10.139.25.254:42083 (238 connections now open)
Mon Sep  9 10:38:28.596 [conn895654] end connection 10.139.25.254:42087 (236 connections now open)
Mon Sep  9 10:38:28.596 [conn895649] end connection 10.139.25.254:42081 (235 connections now open)
Mon Sep  9 10:38:28.596 [conn895645] end connection 10.139.25.254:42077 (235 connections now open)
Mon Sep  9 10:38:28.596 [conn895655] end connection 10.139.25.254:42085 (233 connections now open)
Mon Sep  9 10:38:28.601 [initandlisten] connection accepted from 10.139.25.254:42111 #895720 (234 connections now open)
Mon Sep  9 10:38:28.601 [initandlisten] connection accepted from 10.139.25.254:42112 #895721 (235 connections now open)
Mon Sep  9 10:38:28.603 [initandlisten] connection accepted from 10.139.25.254:42113 #895722 (236 connections now open)
Mon Sep  9 10:38:28.604 [initandlisten] connection accepted from 10.139.25.254:42114 #895723 (237 connections now open)
 
 
***** SERVER RESTARTED *****
 
 
Mon Sep  9 14:12:11.108 [initandlisten] MongoDB starting : pid=29285 port=27017 dbpath=/mongodb_data 64-bit host=ip-10-166-29-94
Mon Sep  9 14:12:11.109 [initandlisten] db version v2.4.1
Mon Sep  9 14:12:11.109 [initandlisten] git version: 1560959e9ce11a693be8b4d0d160d633eee75110
Mon Sep  9 14:12:11.109 [initandlisten] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Mon Sep  9 14:12:11.109 [initandlisten] allocator: tcmalloc
Mon Sep  9 14:12:11.109 [initandlisten] options: { auth: "true", config: "/etc/mongod.conf", dbpath: "/mongodb_data", fork: "true", keyFile: "/var/lib/mongo/rskey", logappend: "true", logpat
h: "/var/log/mongo/mongod.log", pidfilepath: "/var/run/mongodb/mongod.pid", replSet: "RS-aviarymongo-571" }
Mon Sep  9 14:12:11.117 [initandlisten] journal dir=/mongodb_data/journal
Mon Sep  9 14:12:11.117 [initandlisten] recover begin
Mon Sep  9 14:12:11.117 [initandlisten] recover lsn: 299400579
Mon Sep  9 14:12:11.117 [initandlisten] recover /mongodb_data/journal/j._3
Mon Sep  9 14:12:11.132 [initandlisten] recover cleaning up
Mon Sep  9 14:12:11.132 [initandlisten] removeJournalFiles
Mon Sep  9 14:12:11.310 [initandlisten] recover done
Mon Sep  9 14:12:11.310 [initandlisten] preallocating a journal file /mongodb_data/journal/prealloc.0
Mon Sep  9 14:12:40.763 [websvr] admin web console waiting for connections on port 28017
Mon Sep  9 14:12:40.763 [initandlisten] waiting for connections on port 27017



 Comments   
Comment by Dharshan Rangegowda [ 21/Dec/13 ]

Looks like "ulimit -c unlimited" will do the trick. Thanks for pointing me in the right direction.

Comment by Dharshan Rangegowda [ 21/Dec/13 ]

Hi Samantha,

Sorry for the delayed answer. I am using the Amazon Linux AMI. I tried the prlimit --help command and it does not work. What are my options for increasing the core file size?

Comment by Samantha Ritter (Inactive) [ 05/Nov/13 ]

If you have not explicitly set your logging level, I believe it will be 0. You can use

 db.runCommand({ getParameter: 1, logLevel: 1 }) 

against your admin database to check your current logging level, and then you should set it higher if it's low. Can you also tell me what your logpath is?

Unfortunately the libc error isn't very helpful on its own. The addresses it reports are local to your machine. Also, an error from that library indicates a possible misuse of the library calls somewhere in mongodb. Where that happens is much more important in figuring out this problem.

We could really use a core dump here. Because your core file size is set to 0, though, the system won't leave you a core dump even in the case of a segfault. Do you know what version of linux your server is running? Can you try typing 'prlimit --help' into your terminal to see if that command is installed?

Also, can you give me as much information as you can about your environment as possible? What exact AMI are you using, and is it the same on both clusters where you saw the issue? Is this a standalone mongod, or are you running a replica set or a sharded cluster? Generally, what kind of data and load is this machine handling?

Comment by Dharshan Rangegowda [ 04/Nov/13 ]

If the segfault line in lib-2.212.so not useful enough for debugging? What is the default level of logging? If it happens again I can update the bug.

Here are my ulimit values
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 63876
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 63876
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Comment by Samantha Ritter (Inactive) [ 04/Nov/13 ]

Hi Dharshan, I'm very sorry for the delayed response. Are you still having this same problem? Has this bug appeared again?

What level of logging are you running with currently? You can follow the instructions here to set the log level as high as 5: http://docs.mongodb.org/manual/reference/parameters/#param.logLevel

And as for the core dump, what are your ulimit values set to? Can you run "ulimit -a" ?

Comment by Dharshan Rangegowda [ 09/Oct/13 ]

Hi - I had a repro of this bug again on the same box that I reported the earlier segfault in. I've now had about 3 repro's of this bug over the course of a month. All the crashes have been on the secondary and there are no log entries in the syslog or in the mongod log. Is there some setting I can turn on to enable more verbose debugging?

Comment by Dharshan Rangegowda [ 03/Oct/13 ]

Scott - I was able to repro the issue on a different cluster. The process crashed with no logs. Here is what I found in the system log

Sep 29 03:27:01 SG-edSpringProdSSD-1285 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="996" x-info="http://www.rsyslog.com"] rsyslogd was HU
Ped
Oct 1 21:30:09 SG-edSpringProdSSD-1285 kernel: hrtimer: interrupt took 3055698 ns
Oct 2 10:15:33 SG-edSpringProdSSD-1285 auditd[980]: Audit daemon rotating log files
Oct 3 14:12:29 SG-edSpringProdSSD-1285 kernel: mongod[12754]: segfault at 7fabc8fc6000 ip 00007fad23c6697b sp 00007fabc8fc3168 error 6 in libc-2.12.so[7fad2
3bdd000+18a000]

Shouldn't there be a crash dump for a segfault? I looked in /usr/bin where my mongod is installed and did not find any crash dumps.

Comment by Dharshan Rangegowda [ 13/Sep/13 ]

Not yet. I am going to give it a few more days and will report back if it occurs.

Comment by Scott Hernandez (Inactive) [ 13/Sep/13 ]

Has this happened again? If so, when?

I didn't see anything in MMS that jumped out. There is no easy way I know of to detect EBS issues, other than logging on the client with dmesg or in /var/log.

Comment by Dharshan Rangegowda [ 10/Sep/13 ]

Here you go - https://mms.mongodb.com/host/detail/5215392c7ec5df2d7bff16d8/addd82321ed56deb3046c432819dc3ec

Comment by Scott Hernandez (Inactive) [ 10/Sep/13 ]

If you click on the host in MMS you will get a url like this: https://mms.mongodb.com/host/detail/<id>

That would be what we are looking for; or the group name in mms in the upper left hand corner of the MMS page.

Comment by Dharshan Rangegowda [ 10/Sep/13 ]

Can you tell me what you mean by MMS link? Here is the name of the machine in MMS

SG-aviarymongo-1101.servers.mongodirector.com

Btw the crash happens at time of zero load on the system.

Comment by Gregor Macadam [ 10/Sep/13 ]

Hi

Can you include the MMS link as Scott requested?

thx
Gregor

Comment by Dharshan Rangegowda [ 10/Sep/13 ]

Hi Scott,

It is a production cluster. So immediate upgrade of the version is not possible. The OS is Amazon linux (variant of CentOS).

Comment by Scott Hernandez (Inactive) [ 10/Sep/13 ]

I'd suggest upgrading to 2.4.6 – unrelated to this incident.

Can you post the MMS link or group name so I can take a look at the stats? You might want to grep the whole /var/log/ directory for mongod to see if anything else has related messages. What AMI/distro are you using (some redhat/centos distro)?

Comment by Dharshan Rangegowda [ 09/Sep/13 ]

1. No messages about mongod in /var/log/messages. I think this is the syslog log in Amazon linux. Please let me know otherwise.
2. No messages about mongod in dmesg
3. Mongo process is installed in /usr/bin. Doesn't seem to be any dmp files in this location
4. I installed using the rpm. So I use "service mongod start" to start the process.
6. Do you know how I can check if the amazon EBS volume went offline at this time?

Another note - I have munin and mms agent installed and I am running version 2.4.1. Is there anything else I can configure so that we have more information the next time it happens?

Comment by Scott Hernandez (Inactive) [ 09/Sep/13 ]

I'd suggest looking for "mongod" to see if there are any messages about the process, and look at dmesg as well. Do you have a syslog log? If so, it might contain some indication around the same time.

Can you look for a dump file in the working directory where you stated the process, to see if it crashed?

It is very unusual, and not something I've seen, if there is no message in the system logs, nor a dump file, and no logging as you provided.

Is there any chance that the device that was being logged to went offline or read-only at that time?

How is the mongod started and what is starting it?

Comment by Dharshan Rangegowda [ 09/Sep/13 ]

Hi Scott,

I already checked /var/log/messages and I see no messages from the OOM killer around the time mongo died. Are there any other logs I need to check?

Comment by Scott Hernandez (Inactive) [ 09/Sep/13 ]

Please check the system logs and make sure that the OOM killer did not terminate the process.

If it did, please read more here: http://docs.mongodb.org/manual/administration/production-notes/#swap

Generated at Thu Feb 08 03:23:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.