[SERVER-13134] Mongo shutting down randomly after upgrade to Ubuntu 12.04 Created: 11/Mar/14 Updated: 05/May/14 Resolved: 03/Apr/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Justin Lintz | Assignee: | Stennie Steneker (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | replicaset | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
We have a few replicasets and we've seen on a few secondary servers the mongodb process shutting down randomly with this error message right before the shutdown (replaced hostname with PRIMARY).
The issue did not occur on Ubuntu 10.04. We are using version 2.4.7, the debian from the mongodb repo. The issue has occurred on a couple of different replicasets. |
| Comments |
| Comment by Thomas Rueckstiess [ 03/Apr/14 ] | |
|
Thanks for letting us know, Justin. I'm closing this as requested. Thomas | |
| Comment by Justin Lintz [ 27/Mar/14 ] | |
|
Ok, We sort of tracked down this issue, but not really sure how it's exactly occurring. On the servers where mongod was being stopped, puppet had stale .pid files on a few of the servers for the puppet agent under /var/run/puppet/agent.pid. Puppet was set to ensure the service was not running. We tracked down the time mongod was being stopped to being a few seconds after a puppet run would finish and in the puppet logs we saw a line for ensuring the puppet service was not running. First thought was the agent.pid file somehow had the same process id of mongo running, but they did not match, nor were they even close. It also wasn't consistent in occurring. Puppet wasn't always ensuring the puppet service was stopped, and could not really determine why it was sometimes running and sometimes wasnt, but everytime it did, somehow mongod got shutdown. I'm at a loss for how it's occurring, it could be an issue in puppet, or in upstart. Anyway, once we removed the stale agent.pid files, the issue stopped occurring. We can close out this issue | |
| Comment by Justin Lintz [ 20/Mar/14 ] | |
|
Is there any additional information that we could provide that would help troubleshoot this? | |
| Comment by Justin Lintz [ 18/Mar/14 ] | |
|
Stephen, Thanks, just sent the log to the scp dumpsite. | |
| Comment by Stennie Steneker (Inactive) [ 18/Mar/14 ] | |
|
Hi Justin, If you're concerned about privacy of the logs, you can upload them to a private scp dropbox (accessible only to MongoDB team members) with:
Just hit enter when prompted for a password. If you prefer to filter before uploading, that is certainly your prerogative .. however this may remove some useful context from the logs. Thanks, | |
| Comment by Justin Lintz [ 17/Mar/14 ] | |
|
In a lot of cases, right before the signal was received we saw this in the logs [rsSyncNotifier] DBClientCursor::init call() failed | |
| Comment by Justin Lintz [ 17/Mar/14 ] | |
|
Stephen,
legacydata:SECONDARY> getBuildInfo()
The issue occurs about once a day on one of the clusters , on another it's less frequent. The exits occur at different times each time. | |
| Comment by Stennie Steneker (Inactive) [ 17/Mar/14 ] | |
|
Hi Justin, Can you please:
How frequently does this issue occur (and is it still occurring)? The log information in the issue description isn't complete; was there also a "got signal 15" message prior to the shutdown? As you've noted, "got signal 15" (SIGTERM) normally indicates the process was terminated by a user. Hopefully there will be more context in the full mongod log. Thanks, | |
| Comment by Justin Lintz [ 12/Mar/14 ] | |
|
Just occurred again today on the same server, this time the log on the secondary had this Wed Mar 12 13:13:16.590 [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends Can't seem to find anything that would be sending a SIGTERM to mongod. | |
| Comment by Justin Lintz [ 12/Mar/14 ] | |
| |
| Comment by Daniel Pasette (Inactive) [ 12/Mar/14 ] | |
|
A few questions:
|