[SERVER-15417] Arbiter didn't elect primary if OS is unreachable (except ping) Created: 26/Sep/14  Updated: 23/Jan/15  Resolved: 23/Jan/15

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 2.6.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Xavier Vdb Assignee: Ramon Fernandez Marina
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-14139 Disk failure on one node can (eventua... Closed
Operating System: ALL
Participants:

 Description   

My VMs is hosted with ESX.

the virtual machine that hosts my master is unreachable (ack fail)
but the classic ping still working

MongoDB shell version: 2.6.3
connecting to: XXXX:27017/admin
Socket recv() errno:104 Connection reset by peer X.X.X.X:27017

my master from the secondary :

	"members" : [
		{
			"_id" : 0,
			"name" : "X.X.X.X:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 23194,
			"optime" : Timestamp(1411574324, 1),
			"optimeDate" : ISODate("2014-09-24T15:58:44Z"),
			"lastHeartbeat" : ISODate("2014-09-26T14:53:02Z"),
			"lastHeartbeatRecv" : ISODate("2014-09-26T14:53:10Z"),
			"pingMs" : 10109,
			"electionTime" : Timestamp(1411720008, 1),
			"electionDate" : ISODate("2014-09-26T08:26:48Z")
		},

I can't even force the master ! (can't connect to ssh to modify priority)

I'm stuck



 Comments   
Comment by Ramon Fernandez Marina [ 23/Jan/15 ]

xavier.vdb@gmail.com, the scenario you describe is contained in SERVER-14139, so I'm going to close this ticket as a duplicate. Please feel free to tune into SERVER-14139 for updates and vote for it.

Regards,
Ramón.

Comment by Xavier Vdb [ 02/Dec/14 ]

hey Ramon

"did the database files became read-only?"

Yes, journal + data

Comment by Ramon Fernandez Marina [ 02/Dec/14 ]

Apologies for the late response xavier.vdb@gmail.com. I think the behavior you're observing is very similar to the one described in tickets like SERVER-12793, SERVER-14214 or SERVER-14139, where the primary continues to send heartbeats but is otherwise unable to operate normally because of external reasons (e.g.: network partitions, inaccessible storage, etc.).

You mention a NFS mount that went read-only; was MongoDB hosted there? In other words, did the database files became read-only? Or was this NFS mount used for something else, but was using the "hard" option?

Comment by Xavier Vdb [ 01/Oct/14 ]

i have a new info : nfs mount has been switched in read only

Comment by Xavier Vdb [ 29/Sep/14 ]

log secondary :

2014-09-26T15:56:34.152+0200 [rsHealthPoll] can't authenticate to X.X.X.X_crashed:27017 (X.X.X.X) failed as internal user, error: DBClientBase::findN: transport error: X.X.X.X_crashed:27017 ns: local.$cmd query:

{ getnonce: 1 }

2014-09-26T15:56:44.154+0200 [rsHealthPoll] DBClientCursor::init call() failed
2014-09-26T15:56:44.175+0200 [rsHealthPoll] replset info X.X.X.X_crashed:27017 just heartbeated us, but our heartbeat failed: , not changing state

log arbiter (same traces as secondary...) :

2014-09-26T15:56:17.520+0200 [rsHealthPoll] DBClientCursor::init call() failed
2014-09-26T15:56:17.520+0200 [rsHealthPoll] can't authenticate to X.X.X.X_crashed:27017 (X.X.X.X) failed as internal user, error: DBClientBase::findN: transport error: X.X.X.X_crashed:27017 ns: local.$cmd query:

{ getnonce: 1 }

...

No logs on the master just before crash

Comment by Xavier Vdb [ 29/Sep/14 ]

rsa:PRIMARY> cfg = rs.conf()
{
	"_id" : "rsa",
	"version" : 5,
	"members" : [
		{
			"_id" : 0,
			"host" : "X.X.X.X_crashed:27017"
		},
		{
			"_id" : 1,
			"host" : "X.X.X.X:27017",
			"priority" : 0.5
		},
		{
			"_id" : 2,
			"host" : "X.X.X.X:27017",
			"arbiterOnly" : true
		}
	]
}

Comment by Ramon Fernandez Marina [ 26/Sep/14 ]

xavier.vdb@gmail.com, can you please provide the details of your replica set? Number of members, types, and priorities if applicable. Also, can you upload logs for all members going at least as far back as the moment when the master became unresponsive?

Thanks,
Ramón.

Generated at Thu Feb 08 03:37:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.