[SERVER-6839] SECONDARY keeps crashing Created: 24/Aug/12  Updated: 21/Sep/12  Resolved: 21/Sep/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kerim Satirli Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod-shardsvr.log     Text File mongod-shardsvr.log    
Operating System: Linux
Participants:

 Description   

A SECONDARY in our replica set keeps crashing for an unknown reason.

Our setup looks like this:

  • PRIMARY (2.0.6)
  • SECONDARY (2.0.6)
  • SECONDARY (2.2RC0)
  • ARBITER (2.0.6)

The first secondary (2.0.6) keeps crashing at, seemingly, random moments. A look at the log file only shows this as the final part:

set1
hu Aug 23 13:18:44 [PeriodicTask::Runner] task: WriteBackManager::cleaner took: 5ms
Thu Aug 23 13:18:45 [PeriodicTask::Runner] task: DBConnectionPool-cleaner took: 48ms
Thu Aug 23 13:18:45 [conn295] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db3.set1:27018", $auth: {} } ntoreturn:1 reslen:120 130ms
Thu Aug 23 13:18:45 [conn296] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db1.set1:27018" } ntoreturn:1 reslen:120 130ms
Thu Aug 23 13:18:45 [PeriodicTask::Runner] task: DBConnectionPool-cleaner took: 5ms
Thu Aug 23 13:18:50 [conn297] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db0.set1:27018" } ntoreturn:1 reslen:120 229ms
Thu Aug 23 13:18:50 [conn295] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db3.set1:27018", $auth: {} } ntoreturn:1 reslen:120 229ms
Thu Aug 23 13:18:50 [conn296] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db1.set1:27018" } ntoreturn:1 reslen:120 229ms
Thu Aug 23 13:18:52 [initandlisten] connection accepted from 10.59.62.168:47834 #298
Thu Aug 23 13:18:53 [conn296] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db1.set1:27018" } ntoreturn:1 reslen:120 340ms
Thu Aug 23 13:18:53 [conn297] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db0.set1:27018" } ntoreturn:1 reslen:120 340ms
Thu Aug 23 13:18:53 [conn295] end connection 10.59.62.168:47831
Thu Aug 23 13:18:56 [conn296] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db1.set1:27018" } ntoreturn:1 reslen:120 372ms
Thu Aug 23 13:18:56 [conn297] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db0.set1:27018" } ntoreturn:1 reslen:120 372ms
Thu Aug 23 13:18:57 [conn298] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db3.set1:27018", $auth: {} } ntoreturn:1 reslen:120 326ms
Thu Aug 23 13:19:00 [conn296] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db1.set1:27018" } ntoreturn:1 reslen:120 522ms
Thu Aug 23 13:19:00 [conn297] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db0.set1:27018" } ntoreturn:1 reslen:120 522ms
Thu Aug 23 13:19:00 [conn298] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db3.set1:27018", $auth: {} } ntoreturn:1 reslen:120 522ms
		42625600/44589575	95%
Thu Aug 23 13:19:04 [conn296] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db1.set1:27018" } ntoreturn:1 reslen:120 142ms
Thu Aug 23 13:19:04 [conn298] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db3.set1:27018", $auth: {} } ntoreturn:1 reslen:120 142ms
Thu Aug 23 13:19:04 [conn297] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db0.set1:27018" } ntoreturn:1 reslen:120 142ms
Thu Aug 23 13:19:07 [initandlisten] connection accepted from 10.230.41.239:52398 #299
Thu Aug 23 13:19:07 [conn296] end connection 10.230.41.239:52393
Thu Aug 23 13:19:07 [conn298] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db3.set1:27018", $auth: {} } ntoreturn:1 reslen:120 101ms
Thu Aug 23 13:19:07 [conn297] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db0.set1:27018" } ntoreturn:1 reslen:120 101ms
Thu Aug 23 13:19:08 [conn299] command admin.$cmd command: { replSetHeartbeat: "set1", v: 9, pv: 1, checkEmpty: false, from: "db1.set1:27018" } ntoreturn:1 reslen:120 326ms

Here's some more information:

  • all instances run on EC2 w/ a couple of extra volumes each for storage
  • all installs of MongoDB (except for 2.2) were done through the official YUM repo
  • nothing in our config changed in the past days
  • our processed data volume did not increase in the past days


 Comments   
Comment by Ian Whalen (Inactive) [ 21/Sep/12 ]

closing, but please reopen if you have the core dumps (or other info) available to attach

Comment by Jeremy Mikola [ 05/Sep/12 ]

Is there a core dump available for either crash?

Comment by Kerim Satirli [ 27/Aug/12 ]

full log of the failing SECONDARY.

Disregard the filename; it stems from a point when we used to use replicasets. We did not yet get around to updating the name of the config file and log file.

Comment by Kerim Satirli [ 27/Aug/12 ]

Sure can.

Here's rs.conf();

 
rs.conf();
{
	"_id" : "set1",
	"version" : 9,
	"members" : [
		{
			"_id" : 1,
			"host" : "db1.set1:27018",
			"priority" : 100
		},
		{
			"_id" : 2,
			"host" : "db2.set1:27018",
			"priority" : 95
		},
		{
			"_id" : 3,
			"host" : "db0.set1:27018",
			"priority" : 0,
			"arbiterOnly" : true
		},
		{
			"_id" : 4,
			"host" : "db3.set1:27018",
			"priority" : 0
		}
	]
}

I will attach another log from when it crashed.

Comment by Jeremy Mikola [ 24/Aug/12 ]

Can you share rs.conf() as well?

Based on the log you shared, it appears that index creation (for the collection with 44589575 documents) runs from 1-99% and presumably completes, but then restarts again, ultimately hanging at 95% from "Thu Aug 23 13:00:07" until the time of the crash at "Thu Aug 23 13:19:08".

Was there a core dump from the crash, or is the log the only evidence we have?

Comment by Kerim Satirli [ 24/Aug/12 ]

Hey Scott,

thanks for getting this. What I mean by "crash" is an actual application crash. As in: mongod will not show up in the current process list and, naturally, will be marked as unhealthy / unreachable in the rs.status() call.

Here's the output from "free -ltm":

             total       used       free     shared    buffers     cached
Mem:          7455        103       7352          0          4         26
Low:          7455        103       7352
High:            0          0          0
-/+ buffers/cache:         72       7383
Swap:            0          0          0
Total:        7455        103       7352

I indeed was building an index on the primary and the process finished there, just fine. Running some tests, it also showed that queries were faster, so that leads me to believe that everything worked out.

Comment by Scott Hernandez (Inactive) [ 24/Aug/12 ]

Can you please include the full logs showing the time the "crash" happens and the restart? Can you define what you mean by "crash" if there is no error in the logs?

Also, what does free -ltm show?

It looks like from the logs you are building an index, is this something you did on the primary and which is replicating to the secondaries and being applied at the time of the "crash"?

Comment by Kerim Satirli [ 24/Aug/12 ]

For the sake of completion, here's an output of rs.status();

 
rs.status();
{
	"set" : "set1",
	"date" : ISODate("2012-08-24T08:38:01Z"),
	"myState" : 2,
	"syncingTo" : "db3.set1:27018",
	"members" : [
		{
			"_id" : 1,
			"name" : "db1.set1:27018",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 189,
			"optime" : {
				"t" : 1345797478000,
				"i" : 14
			},
			"optimeDate" : ISODate("2012-08-24T08:37:58Z"),
			"lastHeartbeat" : ISODate("2012-08-24T08:38:00Z"),
			"pingMs" : 1
		},
		{
			"_id" : 2,
			"name" : "db2.set1:27018",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"optime" : {
				"t" : 1345715003000,
				"i" : 1
			},
			"optimeDate" : ISODate("2012-08-23T09:43:23Z"),
			"self" : true
		},
		{
			"_id" : 3,
			"name" : "db0.set1:27018",
			"health" : 1,
			"state" : 7,
			"stateStr" : "ARBITER",
			"uptime" : 191,
			"optime" : {
				"t" : 0,
				"i" : 0
			},
			"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
			"lastHeartbeat" : ISODate("2012-08-24T08:38:00Z"),
			"pingMs" : 1
		},
		{
			"_id" : 4,
			"name" : "db3.set1:27018",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 191,
			"optime" : {
				"t" : 1345797469000,
				"i" : 7023
			},
			"optimeDate" : ISODate("2012-08-24T08:37:49Z"),
			"lastHeartbeat" : ISODate("2012-08-24T08:38:00Z"),
			"pingMs" : 7
		}
	],
	"ok" : 1
}

I also forgot to mention that the 2.2 SECONDARY is set up to never become a primary based on priorities. The reason for that is that we want to test 2.2 mainly im terms of playing nice with others.

Generated at Thu Feb 08 03:12:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.