[SERVER-3159] Re-adding member in replica set fails Created: 27/May/11  Updated: 12/Jul/16  Resolved: 02/Jun/11

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.6.5
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Pieter Ennes Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Replica Set on Ubuntu 10.4 LTS on Amazon EC2


Attachments: File mongodb-3159.log.gz    
Issue Links:
Related
is related to SERVER-4344 Re-adding former replica set member h... Closed
Operating System: Linux
Participants:

 Description   

Removing and re-adding a member of a replica set seems to work:

> rs.remove("m2:27017")

{ "ok" : 1 }

> rs.status();
{
"set" : "wm1",
"date" : "Fri May 27 2011 13:56:52 GMT+0000 (UTC)",
"myState" : 1,
"members" : [

{ "_id" : 1, "name" : "web2:27017", "health" : 1, "state" : 7, "uptime" : 2, "lastHeartbeat" : "Fri May 27 2011 13:56:52 GMT+0000 (UTC)" }

,

{ "_id" : 2, "name" : "web3:27017", "health" : 1, "state" : 7, "uptime" : 2, "lastHeartbeat" : "Fri May 27 2011 13:56:52 GMT+0000 (UTC)" }

,

{ "_id" : 3, "name" : "m5:27017", "health" : 1, "state" : 1, "self" : true }

,

{ "_id" : 4, "name" : "m3:27017", "health" : 1, "state" : 2, "uptime" : 2, "lastHeartbeat" : "Fri May 27 2011 13:56:52 GMT+0000 (UTC)" }

],
"ok" : 1
}

> rs.add("m2:27017")

{ "ok" : 1 }

> rs.status();
{
"set" : "wm1",
"date" : "Fri May 27 2011 13:57:01 GMT+0000 (UTC)",
"myState" : 1,
"members" : [

{ "_id" : 1, "name" : "web2:27017", "health" : 1, "state" : 7, "uptime" : 11, "lastHeartbeat" : "Fri May 27 2011 13:57:00 GMT+0000 (UTC)" }

,

{ "_id" : 2, "name" : "web3:27017", "health" : 1, "state" : 7, "uptime" : 11, "lastHeartbeat" : "Fri May 27 2011 13:57:00 GMT+0000 (UTC)" }

,

{ "_id" : 3, "name" : "m5:27017", "health" : 1, "state" : 1, "self" : true }

,
{
"_id" : 4,
"name" : "m3:27017",
"health" : 1,
"state" : 2,
"uptime" : 11,
"lastHeartbeat" : "Fri May 27 2011 13:57:00 GMT+0000 (UTC)",
"errmsg" : "syncThread: 13106 nextSafe():

{ $err: \"cursor dropped during query\", code: 13338 }

"
},

{ "_id" : 5, "name" : "m2:27017", "health" : 1, "state" : 4, "uptime" : 2, "lastHeartbeat" : "Fri May 27 2011 13:56:59 GMT+0000 (UTC)" }

],
"ok" : 1
}

But the logs on the re-added node show this repeating message:

Fri May 27 14:03:14 [rs Manager] replSet error unexpected exception in haveNewConfig() : 0 assertion db/repl/rs.cpp:315
Fri May 27 14:03:14 [rs Manager] replSet error fatal, stopping replication
Fri May 27 14:03:15 [initandlisten] connection accepted from 10.254.238.86:57200 #107
Fri May 27 14:03:16 [conn107] end connection 10.254.238.86:57200
Fri May 27 14:03:16 [rs Manager] replset msgReceivedNewConfig version: version: 7
Fri May 27 14:03:17 [rs Manager] replSet info saving a newer config version to local.system.replset
Fri May 27 14:03:23 [rs Manager] replSet 0 5
Fri May 27 14:03:23 [rs Manager] Assertion failure false db/repl/rs.cpp 315
0x534a81 0x54163f 0x66a6aa 0x66e97e 0x66f5a6 0x695073 0x549473 0x547821 0x546f83 0x52b028 0x83a4b0 0x7f097a4939ca 0x7f0979a4270d
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0xb1) [0x534a81]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0x10f) [0x54163f]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl14initFromConfigERNS_13ReplSetConfigEb+0x17a) [0x66a6aa]
/usr/bin/mongod(_ZN5mongo7ReplSet13haveNewConfigERNS_13ReplSetConfigEb+0xfe) [0x66e97e]
/usr/bin/mongod(_ZN5mongo7Manager20msgReceivedNewConfigENS_7BSONObjE+0x256) [0x66f5a6]
/usr/bin/mongod(_ZN5boost6detail8function26void_function_obj_invoker0INS_3_bi6bind_tIvNS_4_mfi3mf1IvN5mongo7ManagerENS7_7BSONObjEEENS3_5list2INS3_5valueIPS8_EENSC_IS9_EEEEEEvE6invokeERNS1_15function_bufferE+0x63) [0x695073]
/usr/bin/mongod(_ZNK5boost9function0IvEclEv+0x243) [0x549473]
/usr/bin/mongod(_ZN5mongo4task6Server6doWorkEv+0x141) [0x547821]
/usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x33) [0x546f83]
/usr/bin/mongod(_ZN5mongo13BackgroundJob3thrEv+0x88) [0x52b028]
/usr/bin/mongod(thread_proxy+0x80) [0x83a4b0]
/lib/libpthread.so.0(+0x69ca) [0x7f097a4939ca]
/lib/libc.so.6(clone+0x6d) [0x7f0979a4270d]



 Comments   
Comment by Kristina Chodorow (Inactive) [ 02/Jun/11 ]

Good idea, I've added a section in http://www.mongodb.org/display/DOCS/Adding+a+New+Set+Member

Comment by Pieter Ennes [ 02/Jun/11 ]

Ah, that's a good tip, maybe this is worth a note in the Wiki somewhere? (or I overlooked it there)

We managed to add back the node after upgrading the cluster form 1.6.5 to 1.8.1. Seems the situation has changed a little in that release, for the better!

This can be closed I think, thanks.

Comment by Kristina Chodorow (Inactive) [ 01/Jun/11 ]

Perfect, I see what happened.

Each member of the set has an _id (0, 1, 2, etc.). The error you're getting is that the _id of the member you're adding has changed. It looks like you removed the member with _id : 0 and then tried to re-add it with _id : 5. This is where the command line helper's abstraction breaks down: you have to make sure that if you removed a replica set member with a certain _id, you add it back with the same _id. So, instead of rs.add("m2:27017"), you'd have to do:

rs.add({_id : 0, host : "m2:27017"})

Then you should be able to call remove/add back and forth.

Comment by Pieter Ennes [ 29/May/11 ]

Kristina, please find it attached.

This may be relevant too: Please note the change of server version between the first run (1.6.5, starting at Fri May 27 13:54:15 in the log) and the second run (1.8.1, starting at Fri May 27 14:11:33), which we did in an attempt to bypass the error.

Comment by Kristina Chodorow (Inactive) [ 27/May/11 ]

It's very doubtful SERVER-2981 will be backported, we try to only backport critical bug fixes.

Comment by Kristina Chodorow (Inactive) [ 27/May/11 ]

What version are you running and can you paste a bigger chunk of the logs?

Comment by Pieter Ennes [ 27/May/11 ]

Forgot to note that the re-added node was started with a clean data directory before issuing the rs.remove/add() sequence on the primary.

Comment by Pieter Ennes [ 27/May/11 ]

Doing the same thing again leads to https://jira.mongodb.org/browse/SERVER-2981:

> rs.remove("m2:27017")

{ "ok" : 1 }

> rs.add("m2:27017")
{
"assertion" : "need most members up to reconfigure, not ok : m2:27017",
"assertionCode" : 13144,
"errmsg" : "db assertion failure",
"ok" : 0
}

What happened in the first place?
Should this action not be idempotent?
Any chance SERVER-2981 will be back-ported to the stable branch?

Generated at Thu Feb 08 03:02:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.