[SERVER-14647] killed by SEGV signal Created: 22/Jul/14  Updated: 10/Dec/14  Resolved: 15/Aug/14

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 2.6.3
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Joel Moss Assignee: J Rassi
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive Archive.zip     PNG File Screen Shot 2014-07-22 at 21.01.00.png     File log.tar.gz    
Operating System: ALL
Participants:

 Description   

We have a 3 member replica set, and since upgrading to 2.6.3 a few weeks ago, at least two of the three members have experiencing this error on two separate occasions:

    init: mongod main process (28391) killed by SEGV signal

A mongo restart fixes it.

The mongo logs simply show nothing between the time this error occurred the syslog, and the time we restarted mongo.

2014-07-22T06:56:37.343+0000 [initandlisten] connection accepted from 10.93.168.17:58455 #457833 (1073 connections now open)
2014-07-22T07:05:14.485+0000 ***** SERVER RESTARTED *****
2014-07-22T07:05:14.490+0000 [initandlisten] MongoDB starting : pid=20739 port=27017 dbpath=/data 64-bit host=mongo4

This has happenned too many times now. Can anyone help please or give me some ideas as to where to look for more data? thx



 Comments   
Comment by J Rassi [ 15/Aug/14 ]

I haven't heard back in some time, so I'm resolving this ticket as "cannot reproduce". Please re-open the ticket if you have since been able to reproduce this issue with the verbose log.

Comment by J Rassi [ 25/Jul/14 ]

jmoss@codio.com: just checking in – have you encountered the issue since adding "-v"?

Comment by J Rassi [ 22/Jul/14 ]

Noted. I suppose we'll wait and see if the verbose server log and/or mongostat.log contain any leads.

Comment by Joel Moss [ 22/Jul/14 ]

Its definately not RAM. I have attached the RAM usage over the last 24 hours.

This is the upstart script:

#
# Automatically Generated by Chef, do not edit directly!
#
 
limit as unlimited unlimited
limit cpu unlimited unlimited
limit fsize unlimited unlimited
limit nofile 64000 64000
limit nproc 32000 32000
limit rss unlimited unlimited
 
kill timeout 300 # wait 300s between SIGTERM and SIGKILL.
 
start on runlevel [2345]
stop on runlevel [06]
 
script
  NAME=mongod
  ENABLE_MONGODB="yes"
  if [ -f /etc/default/mongodb ]; then
    . /etc/default/mongodb;
  fi
  if [ "x$ENABLE_MONGOD" = "xyes" ]; then
  exec start-stop-daemon --start --quiet --chuid $DAEMON_USER --exec $DAEMON -- $DAEMON_OPTS
  fi
end script

Comment by Joel Moss [ 22/Jul/14 ]

RAM usage

Comment by J Rassi [ 22/Jul/14 ]

Thanks. Don't see any smoking gun yet. Can you also run "mongostat > mongostat.log" in the background on this machine, and upload the contents of this file after observing the crash once more? I'd like to see if the crash is correlated to high memory usage (perhaps it's a NULL dereference after an out-of-memory condition, which could explain the lack of crash-related log output – your mongod startup script doesn't modify /proc/<pid>/oom_adj, does it?), or a spike of a certain type of operation, etc.

Comment by Joel Moss [ 22/Jul/14 ]

Attached dmesg (where we saw the segfault) and syslog.

Also, this just happenned again. I have since appended -v to mongod, so will see more if it happens again. thx

Comment by J Rassi [ 22/Jul/14 ]

A couple of more requests for information:

  • Could you upload a file containing the output of running the "dmesg" command on this machine, and upload the contents of /var/log/syslog?
  • And, assuming /var/log/messages exists on your system and you observed the "killed by SEGV signal" message in that file, could you upload that file as well? If not, where did you observe this message?

Thanks.

Comment by Joel Moss [ 22/Jul/14 ]

mongo log file

Comment by J Rassi [ 22/Jul/14 ]

I understand that you mentioned in the original ticket description that the log does not contain output at the time of crash. That being said, note that the log still provides a wealth of valuable context (startup information, warnings, replica set election information, an idea of what "normal usage" looks like), so please do reconsider my request for uploading it to the ticket. The log of the member that crashed while in state primary would be the most helpful of the three.

In addition:

  • Could you upload the output of running "db.adminCommand('getCmdLineOpts')" and "rs.conf()" at the mongo shell?
  • Are you using the "--syslog" command-line option? If yes, are you willing to restart your cluster with the "--logpath" option instead (I suspect that that the "syslog" option could be related to the lack of output in the log file during the crash), in the hopes of generating a stack trace in the log when you next reproduce this issue? If no, are you willing to restart your server with the log level 1 (the "-v" option), in the hopes of helping narrow down the set of operations related to the problem when you next reproduce this issue?
  • Could you provide the timestamp of the "init: mongod main process (28391) killed by SEGV signal" message? This will be helpful for correlating the event to an entry in the primary's oplog, if in fact a write operation was the cause of the crash.
Comment by Joel Moss [ 22/Jul/14 ]
  • The log is empty from the time it crashed to the time I restarted it, so no data is shown.
  • Version before was 2.4.x, but that was on different servers. This is a completely new replica set.
  • The primary failed first, then one of the secondaries, but not sure that secondary was the newly elected primary. The third members stayed up.
  • Nothing was happened other than normal usage.

thx

Comment by J Rassi [ 22/Jul/14 ]

Hi,

I'll need additional information to further diagnose this issue:

  • Could you upload the full mongod log for a member that experienced one of these crashes?
  • What version of MongoDB were you running before this upgrade?
  • When you encountered this issue, did you experience crashes of the primary member, or only secondary members? Did multiple members ever crash at the same time?
  • Do you know what action was being performed on the cluster during each of the crashes?

Thanks.

~ Jason Rassi

Generated at Thu Feb 08 03:35:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.