[SERVER-12020] Removing and adding RS member fails with code 13144 Created: 09/Dec/13  Updated: 19/Feb/15  Resolved: 19/Feb/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: Unassigned
Resolution: Done Votes: 0
Labels: elections
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Operating System: ALL
Participants:

 Description   

Remove a member and re-add it promptly. The first attempt to re-add fails, the second succeeds:

rs:PRIMARY> rs.remove('localhost:27018')
2013-12-09T17:30:05.241-0500 DBClientCursor::init call() failed
2013-12-09T17:30:05.241-0500 Error: error doing query: failed at src/mongo/shell/query.js:81
2013-12-09T17:30:05.243-0500 trying reconnect to 127.0.0.1:27017
2013-12-09T17:30:05.243-0500 reconnect 127.0.0.1:27017 ok
rs:PRIMARY> var config = rs.conf()
rs:PRIMARY> config.members.push({_id: 1, host: 'localhost:27018'})
2
rs:PRIMARY> rs.reconfig(config)
{
	"errmsg" : "exception: need most members up to reconfigure, not ok : localhost:27018",
	"code" : 13144,
	"ok" : 0
}
rs:PRIMARY> rs.reconfig(config)
{ "ok" : 1 }

The primary logs:

replSet cmufcc requestHeartbeat localhost:27018 : 9001 socket exception [SEND_ERROR] server [127.0.0.1:27018]
replSet replSetReconfig exception: need most members up to reconfigure, not ok : localhost:27018

I think the offending code is in rs_initiate.cpp:98; it seems the primary thinks it still has a cached connection to the removed member, but the member closed its side of that connection when it was removed. The first attempt to use the old connection fails, and clears the cache. The second attempt creates a new connection and succeeds.



 Comments   
Comment by A. Jesse Jiryu Davis [ 14/Aug/14 ]

This appears to have been fixed between 2.4.8 and 2.6.3. Following the same sequence of commands in 2.6.3 no longer causes an error response from reconfig():

repl0:PRIMARY> rs.remove('localhost:27018')
2014-08-13T22:38:26.549-0400 DBClientCursor::init call() failed
2014-08-13T22:38:26.550-0400 Error: error doing query: failed at src/mongo/shell/query.js:81
2014-08-13T22:38:26.551-0400 trying reconnect to 127.0.0.1:27017
2014-08-13T22:38:26.551-0400 reconnect 127.0.0.1:27017 ok
repl0:PRIMARY> config = rs.conf(); config.members.push({_id: 1, host: 'localhost:27018'}); rs.reconfig(config)
{ "down" : [ "localhost:27018" ], "ok" : 1 }

Although the response shows localhost:27018 is "down", it successfully rejoins the set and becomes a secondary.

Comment by Eric Milkie [ 11/Dec/13 ]

This will most likely be fixed as part of the work for elections and reconfiguration, so we're not prepared to make changes to the codebase as it currently stands.

Generated at Thu Feb 08 03:27:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.