[SERVER-13134] Mongo shutting down randomly after upgrade to Ubuntu 12.04 Created: 11/Mar/14  Updated: 05/May/14  Resolved: 03/Apr/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.7
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Justin Lintz Assignee: Stennie Steneker (Inactive)
Resolution: Done Votes: 0
Labels: replicaset
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

We have a few replicasets and we've seen on a few secondary servers the mongodb process shutting down randomly with this error message right before the shutdown (replaced hostname with PRIMARY).

Mon Mar 10 17:43:41.949 [rsSyncNotifiere] replset tracking exception: exception: 10278 dbclient error communicating with server:PRIMARY:27017
Mon Mar 10 17:43:41.953 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server:PRIMARY:27017

Mon Mar 10 17:43:43.073 [signalProcessingThread]                Progress: 43/241        17%    (File Closing Progress)

Mon Mar 10 17:43:45.050 [signalProcessingThread]                Progress: 192/241      79%    (File Closing Progress)

Mon Mar 10 17:43:45.599 [signalProcessingThread] closeAllFiles() finished

Mon Mar 10 17:43:45.599 [signalProcessingThread] shutdown: removing fs lock...

The issue did not occur on Ubuntu 10.04. We are using version 2.4.7, the debian from the mongodb repo. The issue has occurred on a couple of different replicasets.



 Comments   
Comment by Thomas Rueckstiess [ 03/Apr/14 ]

Thanks for letting us know, Justin. I'm closing this as requested.

Thomas

Comment by Justin Lintz [ 27/Mar/14 ]

Ok, We sort of tracked down this issue, but not really sure how it's exactly occurring.

On the servers where mongod was being stopped, puppet had stale .pid files on a few of the servers for the puppet agent under /var/run/puppet/agent.pid. Puppet was set to ensure the service was not running. We tracked down the time mongod was being stopped to being a few seconds after a puppet run would finish and in the puppet logs we saw a line for ensuring the puppet service was not running. First thought was the agent.pid file somehow had the same process id of mongo running, but they did not match, nor were they even close. It also wasn't consistent in occurring. Puppet wasn't always ensuring the puppet service was stopped, and could not really determine why it was sometimes running and sometimes wasnt, but everytime it did, somehow mongod got shutdown.

I'm at a loss for how it's occurring, it could be an issue in puppet, or in upstart. Anyway, once we removed the stale agent.pid files, the issue stopped occurring. We can close out this issue

Comment by Justin Lintz [ 20/Mar/14 ]

Is there any additional information that we could provide that would help troubleshoot this?

Comment by Justin Lintz [ 18/Mar/14 ]

Stephen,

Thanks, just sent the log to the scp dumpsite.

Comment by Stennie Steneker (Inactive) [ 18/Mar/14 ]

Hi Justin,

If you're concerned about privacy of the logs, you can upload them to a private scp dropbox (accessible only to MongoDB team members) with:

 scp -P 722 <filename> SERVER-13134@www.mongodb.com:

Just hit enter when prompted for a password.

If you prefer to filter before uploading, that is certainly your prerogative .. however this may remove some useful context from the logs.

Thanks,
Stephen

Comment by Justin Lintz [ 17/Mar/14 ]

In a lot of cases, right before the signal was received we saw this in the logs

[rsSyncNotifier] DBClientCursor::init call() failed

Comment by Justin Lintz [ 17/Mar/14 ]

Stephen,

  • We have a lot of IPs and queries in the logs, which we aren't comfortable attaching to the ticket. Can I filter those out?

legacydata:SECONDARY> getBuildInfo()
{
"version" : "2.4.7",
"gitVersion" : "0161738abf06c1f067b56a465b706efd6f4bf2aa",
"sysInfo" : "Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49",
"loaderFlags" : "-fPIC -pthread -rdynamic",
"compilerFlags" : "-Wnon-virtual-dtor -Woverloaded-virtual -fPIC -fno-strict-aliasing -ggdb -pthread -Wall -Wsign-compare -Wno-unknown-pragmas -Winvalid-pch -Werror -pipe -fno-builtin-memcmp -O3",
"allocator" : "tcmalloc",
"versionArray" : [
2,
4,
7,
0
],
"javascriptEngine" : "V8",
"bits" : 64,
"debug" : false,
"maxBsonObjectSize" : 16777216
}

  • These are standalone replicasets, no sharding enabled

The issue occurs about once a day on one of the clusters , on another it's less frequent. The exits occur at different times each time.

Comment by Stennie Steneker (Inactive) [ 17/Mar/14 ]

Hi Justin,

Can you please:

  • upload the full log file from one of the secondaries including a period where you received the original error
  • comment with the output of getBuildInfo() from the mongo shell
  • confirm if these are standalone replica sets, or part of a sharded cluster

The issue has occurred on a couple of different replicasets.

How frequently does this issue occur (and is it still occurring)?

The log information in the issue description isn't complete; was there also a "got signal 15" message prior to the shutdown?

As you've noted, "got signal 15" (SIGTERM) normally indicates the process was terminated by a user. Hopefully there will be more context in the full mongod log.

Thanks,
Stephen

Comment by Justin Lintz [ 12/Mar/14 ]

Just occurred again today on the same server, this time the log on the secondary had this

Wed Mar 12 13:13:16.590 [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
Wed Mar 12 13:13:16.590 [signalProcessingThread] now exiting
Wed Mar 12 13:13:16.590 dbexit:

Can't seem to find anything that would be sending a SIGTERM to mongod.

Comment by Justin Lintz [ 12/Mar/14 ]
  • This is during normal replication
  • The only thing on the primary logs at that time was a message about it losing contact with the secondary server
  • Intermittent and only on secondaries. Talking to a coworker, appears its only happened on 2 of our clusters which are each > 2 TB of data, our smaller ones havent seen any issues
Comment by Daniel Pasette (Inactive) [ 12/Mar/14 ]

A few questions:

  • Is this during initial sync or during normal replication on the secondary?
  • Are there any messages in the primary which line up?
  • Is this intermittent or happening on each new secondary you bring up?
Generated at Thu Feb 08 03:30:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.