[SERVER-21062] A REMOVED node that is ahead of the other nodes in the set can prevent a primary from being elected Created: 21/Oct/15  Updated: 27/Oct/15  Resolved: 23/Oct/15

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.2.0-rc0
Fix Version/s: 3.2.0-rc1

Type: Bug Priority: Major - P3
Reporter: Timothy Olsen (Inactive) Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File SERVER-21062.tar.gz     Text File new-configsvr1.log     Text File new-configsvr2.log     Text File new-configsvr3.log     Text File original-configsvr1-post-second-restart.log     Text File original-configsvr1-pre-second-restart.log    
Issue Links:
Depends
Related
related to SERVER-21110 pv1 should not call TopologyCoordinat... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Following an upgrade of mmapv1 SCCC config servers to CSRS, I sometimes (about 60% of the time) see the new replica set get stuck without a primary after the first config server is restarted without --configsvrMode=sccc set and enters the REMOVED state. The remaining 3 replica set members stay in SECONDARY state.

This is with commit dbbc9a2e3d8c4d7fe1748fa980ba7d01b9489dbe.

rs.status():

csrs:REMOVED> rs.status()
{
	"set" : "csrs",
	"date" : ISODate("2015-10-21T21:51:22.697Z"),
	"myState" : 10,
	"term" : NumberLong(1),
	"configsvr" : true,
	"heartbeatIntervalMillis" : NumberLong(2000),
	"members" : [
		{
			"_id" : 0,
			"name" : "neurofunk.local:9007",
			"health" : 1,
			"state" : 10,
			"stateStr" : "REMOVED",
			"uptime" : 53,
			"optime" : {
				"ts" : Timestamp(1445464229, 1),
				"t" : NumberLong(1)
			},
			"optimeDate" : ISODate("2015-10-21T21:50:29Z"),
			"infoMessage" : "could not find member to sync from",
			"configVersion" : 3,
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "neurofunk.local:53836",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 52,
			"optime" : {
				"ts" : Timestamp(1445464217, 1),
				"t" : NumberLong(1)
			},
			"optimeDate" : ISODate("2015-10-21T21:50:17Z"),
			"lastHeartbeat" : ISODate("2015-10-21T21:51:22.161Z"),
			"lastHeartbeatRecv" : ISODate("2015-10-21T21:51:22.111Z"),
			"pingMs" : NumberLong(0),
			"configVersion" : 3
		},
		{
			"_id" : 2,
			"name" : "neurofunk.local:53835",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 52,
			"optime" : {
				"ts" : Timestamp(1445464217, 1),
				"t" : NumberLong(1)
			},
			"optimeDate" : ISODate("2015-10-21T21:50:17Z"),
			"lastHeartbeat" : ISODate("2015-10-21T21:51:22.161Z"),
			"lastHeartbeatRecv" : ISODate("2015-10-21T21:51:22.111Z"),
			"pingMs" : NumberLong(0),
			"configVersion" : 3
		},
		{
			"_id" : 4,
			"name" : "neurofunk.local:53834",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 52,
			"optime" : {
				"ts" : Timestamp(1445464217, 1),
				"t" : NumberLong(1)
			},
			"optimeDate" : ISODate("2015-10-21T21:50:17Z"),
			"lastHeartbeat" : ISODate("2015-10-21T21:51:22.161Z"),
			"lastHeartbeatRecv" : ISODate("2015-10-21T21:51:22.111Z"),
			"pingMs" : NumberLong(0),
			"configVersion" : 3
		}
	],
	"ok" : 1,
	"$gleStats" : {
		"lastOpTime" : Timestamp(0, 0),
		"electionId" : ObjectId("000000000000000000000000")
	}
}
csrs:REMOVED> 

I will attach logs.



 Comments   
Comment by Githook User [ 23/Oct/15 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-21062: removed nodes should not delay elections
Branch: master
https://github.com/mongodb/mongo/commit/24617d5f502245efed45de78634826e3f1e36b9c

Comment by Scott Hernandez (Inactive) [ 23/Oct/15 ]

Yes, different case which we can create an issue for.

For now, removed nodes should not block elections.

Comment by Andy Schwerin [ 23/Oct/15 ]

A priority 0 node that is in state SECONDARY could become the sync source for the electable secondaries, which is different from this case, maybe.

Comment by Eric Milkie [ 23/Oct/15 ]

Good catch. We call that function as part of the pv0 priority election algorithm, as well as in getUnelectableReason() and getMyUnelectableReason(). Those latter two should only take effect under pv0 and not pv1. They should be removed for Raft.

Comment by Siyuan Zhou [ 23/Oct/15 ]

If TopologyCoordinatorImpl::_isOpTimeCloseEnoughToLatestToElect is the reason of this problem, it could happen on a node with priority 0 but having a uncommitted oplog ahead of other secondaries. If this node is the only one that has the uncommitted oplog after the previous primary crashes, it will prevents others from electing themselves.

Shall we remove this rule for protocol version 1, as it is not compatible with RAFT? Generally, protocol version 1 should not rely on the absolute timestamp.

Comment by Timothy Olsen (Inactive) [ 22/Oct/15 ]

waiting for replication works! thanks!

Comment by Timothy Olsen (Inactive) [ 22/Oct/15 ]

yes, thanks!

Comment by Spencer Brody (Inactive) [ 22/Oct/15 ]

Yes, assuming you meant to include "are the same" somewhere in that statement

Comment by Timothy Olsen (Inactive) [ 22/Oct/15 ]

spencer Would checking that the optimes reported from rs.status() across all the nodes do the trick?

Comment by Spencer Brody (Inactive) [ 22/Oct/15 ]

Thanks tim.olsen.
From the oplogs you attached we've determined that the last replset reconfig (performed against the first SCCC node which becomes the REMOVED CSRS) generates an oplog entry that wasn't replicated to the other CSRS config servers before it was shut down and restarted without the --configsvrMode=SCCC flag. Can you try waiting for replication as the last thing you do before restarting the first config server without the configsvrMode flag and see if that fixes the problem?

Comment by Timothy Olsen (Inactive) [ 22/Oct/15 ]

Ok. I had to reproduce the problem again as I don't currently have access to the original reproduction of the problem. I've attached a tarball with everything relevant for the new reproduction: rs.status(), rs.conf(), logfiles, and oplogs

Comment by Andy Schwerin [ 22/Oct/15 ]

tim.olsen , can you attach the oplogs from the 4 servers to the ticket? I'm curious if there are two problems, here.

Comment by Spencer Brody (Inactive) [ 21/Oct/15 ]

Okay, I think I know what's going on:
TopologyCoordinatorImpl::_isOpTimeCloseEnoughToLatestToElect will prevent any node from running for election if it believes there is another node that is up and more than 10 seconds ahead of it. The problem is that nodes in REMOVED state are still counted as "up". We need to exclude REMOVED nodes from this check.

Comment by Timothy Olsen (Inactive) [ 21/Oct/15 ]

Note that this was not happening with 3.1.8

Comment by Timothy Olsen (Inactive) [ 21/Oct/15 ]

Log files attached. I have tried to make the filenames of the logs self descriptive. Let me know if you have any questions.

Comment by Timothy Olsen (Inactive) [ 21/Oct/15 ]

Sure, here's the rs.conf():

csrs:REMOVED> rs.conf()
{
	"_id" : "csrs",
	"version" : 3,
	"configsvr" : true,
	"protocolVersion" : NumberLong(1),
	"members" : [
		{
			"_id" : 0,
			"host" : "neurofunk.local:9007",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 4,
			"host" : "neurofunk.local:53834",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 1,
			"host" : "neurofunk.local:53836",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 2,
			"host" : "neurofunk.local:53835",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 5000,
		"getLastErrorModes" : {
			
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		}
	}
}
csrs:REMOVED> 

Comment by Andy Schwerin [ 21/Oct/15 ]

Can you attach the replica set configuration document to the ticket, please?

Generated at Thu Feb 08 03:56:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.