[SERVER-10722] MongoDB periodically segfaults, no error messages in mongo logs Created: 09/Sep/13 Updated: 18/Mar/15 Resolved: 18/Mar/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Dharshan Rangegowda | Assignee: | Samantha Ritter (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | Linux |
| Participants: |
| Description |
|
My mongodb process dies arbitrarily without any logs. It has happened a couple of times over the last few weeks. I had to manually start the mongod server. I looked at the system logs and it was not an OOM kill. I am on AWS, Amzon linux XLarge instance. Here are the mongod logs from when it died. I would like to understand if I can turn on some logging to determine the root cause if it happens again.
|
| Comments |
| Comment by Dharshan Rangegowda [ 21/Dec/13 ] | |
|
Looks like "ulimit -c unlimited" will do the trick. Thanks for pointing me in the right direction. | |
| Comment by Dharshan Rangegowda [ 21/Dec/13 ] | |
|
Hi Samantha, Sorry for the delayed answer. I am using the Amazon Linux AMI. I tried the prlimit --help command and it does not work. What are my options for increasing the core file size? | |
| Comment by Samantha Ritter (Inactive) [ 05/Nov/13 ] | |
|
If you have not explicitly set your logging level, I believe it will be 0. You can use
Unfortunately the libc error isn't very helpful on its own. The addresses it reports are local to your machine. Also, an error from that library indicates a possible misuse of the library calls somewhere in mongodb. Where that happens is much more important in figuring out this problem. We could really use a core dump here. Because your core file size is set to 0, though, the system won't leave you a core dump even in the case of a segfault. Do you know what version of linux your server is running? Can you try typing 'prlimit --help' into your terminal to see if that command is installed? Also, can you give me as much information as you can about your environment as possible? What exact AMI are you using, and is it the same on both clusters where you saw the issue? Is this a standalone mongod, or are you running a replica set or a sharded cluster? Generally, what kind of data and load is this machine handling? | |
| Comment by Dharshan Rangegowda [ 04/Nov/13 ] | |
|
If the segfault line in lib-2.212.so not useful enough for debugging? What is the default level of logging? If it happens again I can update the bug. Here are my ulimit values | |
| Comment by Samantha Ritter (Inactive) [ 04/Nov/13 ] | |
|
Hi Dharshan, I'm very sorry for the delayed response. Are you still having this same problem? Has this bug appeared again? What level of logging are you running with currently? You can follow the instructions here to set the log level as high as 5: http://docs.mongodb.org/manual/reference/parameters/#param.logLevel And as for the core dump, what are your ulimit values set to? Can you run "ulimit -a" ? | |
| Comment by Dharshan Rangegowda [ 09/Oct/13 ] | |
|
Hi - I had a repro of this bug again on the same box that I reported the earlier segfault in. I've now had about 3 repro's of this bug over the course of a month. All the crashes have been on the secondary and there are no log entries in the syslog or in the mongod log. Is there some setting I can turn on to enable more verbose debugging? | |
| Comment by Dharshan Rangegowda [ 03/Oct/13 ] | |
|
Scott - I was able to repro the issue on a different cluster. The process crashed with no logs. Here is what I found in the system log Sep 29 03:27:01 SG-edSpringProdSSD-1285 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="996" x-info="http://www.rsyslog.com"] rsyslogd was HU Shouldn't there be a crash dump for a segfault? I looked in /usr/bin where my mongod is installed and did not find any crash dumps. | |
| Comment by Dharshan Rangegowda [ 13/Sep/13 ] | |
|
Not yet. I am going to give it a few more days and will report back if it occurs. | |
| Comment by Scott Hernandez (Inactive) [ 13/Sep/13 ] | |
|
Has this happened again? If so, when? I didn't see anything in MMS that jumped out. There is no easy way I know of to detect EBS issues, other than logging on the client with dmesg or in /var/log. | |
| Comment by Dharshan Rangegowda [ 10/Sep/13 ] | |
|
Here you go - https://mms.mongodb.com/host/detail/5215392c7ec5df2d7bff16d8/addd82321ed56deb3046c432819dc3ec | |
| Comment by Scott Hernandez (Inactive) [ 10/Sep/13 ] | |
|
If you click on the host in MMS you will get a url like this: https://mms.mongodb.com/host/detail/<id> That would be what we are looking for; or the group name in mms in the upper left hand corner of the MMS page. | |
| Comment by Dharshan Rangegowda [ 10/Sep/13 ] | |
|
Can you tell me what you mean by MMS link? Here is the name of the machine in MMS SG-aviarymongo-1101.servers.mongodirector.com Btw the crash happens at time of zero load on the system. | |
| Comment by Gregor Macadam [ 10/Sep/13 ] | |
|
Hi Can you include the MMS link as Scott requested? thx | |
| Comment by Dharshan Rangegowda [ 10/Sep/13 ] | |
|
Hi Scott, It is a production cluster. So immediate upgrade of the version is not possible. The OS is Amazon linux (variant of CentOS). | |
| Comment by Scott Hernandez (Inactive) [ 10/Sep/13 ] | |
|
I'd suggest upgrading to 2.4.6 – unrelated to this incident. Can you post the MMS link or group name so I can take a look at the stats? You might want to grep the whole /var/log/ directory for mongod to see if anything else has related messages. What AMI/distro are you using (some redhat/centos distro)? | |
| Comment by Dharshan Rangegowda [ 09/Sep/13 ] | |
|
1. No messages about mongod in /var/log/messages. I think this is the syslog log in Amazon linux. Please let me know otherwise. Another note - I have munin and mms agent installed and I am running version 2.4.1. Is there anything else I can configure so that we have more information the next time it happens? | |
| Comment by Scott Hernandez (Inactive) [ 09/Sep/13 ] | |
|
I'd suggest looking for "mongod" to see if there are any messages about the process, and look at dmesg as well. Do you have a syslog log? If so, it might contain some indication around the same time. Can you look for a dump file in the working directory where you stated the process, to see if it crashed? It is very unusual, and not something I've seen, if there is no message in the system logs, nor a dump file, and no logging as you provided. Is there any chance that the device that was being logged to went offline or read-only at that time? How is the mongod started and what is starting it? | |
| Comment by Dharshan Rangegowda [ 09/Sep/13 ] | |
|
Hi Scott, I already checked /var/log/messages and I see no messages from the OOM killer around the time mongo died. Are there any other logs I need to check? | |
| Comment by Scott Hernandez (Inactive) [ 09/Sep/13 ] | |
|
Please check the system logs and make sure that the OOM killer did not terminate the process. If it did, please read more here: http://docs.mongodb.org/manual/administration/production-notes/#swap |