[SERVER-16808] rs.reconfig makes whole cluster as Secondary Created: 12/Jan/15  Updated: 14/Apr/16  Resolved: 20/Mar/15

Status: Closed
Project: Core Server
Component/s: Admin, Replication
Affects Version/s: 2.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sarbamangal Choudhury Assignee: Ramon Fernandez Marina
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

Setup cluster with 4 nodes.
Make one node hidden and 24 hours delay.
Put lot of load.
Change 24 hours delay to 0.

Participants:

 Description   

We have the following config for replica set.

ecset01:PRIMARY> cfg = rs.conf()
{
	"_id" : "ecset01",
	"version" : 25,
	"members" : [
		{
			"_id" : 1,
			"host" : "10.98.2.21:27017",
			"priority" : 2
		},
		{
			"_id" : 2,
			"host" : "10.98.2.22:27017",
			"priority" : 3
		},
		{
			"_id" : 3,
			"host" : "10.98.2.20:27017",
			"priority" : 4
		},
		{
			"_id" : 4,
			"host" : "10.98.141.170:27017",
			"priority" : 0,
			"hidden" : true
		},
		{
			"_id" : 5,
			"host" : "10.98.141.120:27017",
			"priority" : 0,
			"slaveDelay" : 86400,
			"hidden" : true
		},
		{
			"_id" : 6,
			"host" : "10.10.14.158:27017",
			"priority" : 0,
			"slaveDelay" : 86400,
			"hidden" : true
		}
	]
}

we try to reset slaveDelay from 86400 to 0 .

cfg = rs.conf()
cfg.members[5].slaveDelay = 0
rs.reconfig(cfg)
Mon Jan 12 11:30:15.802 DBClientCursor::init call() failed
Mon Jan 12 11:30:15.804 trying reconnect to 127.0.0.1:27017
Mon Jan 12 11:30:15.804 reconnect 127.0.0.1:27017 ok
Mon Jan 12 11:30:16.007 DBClientCursor::init call() failed
Mon Jan 12 11:30:16.008 JavaScript execution failed: Error: DBClientBase::findN: transport error: 127.0.0.1:27017 ns: admin.$cmd query: { authenticate: 1, nonce: "fe555b6fcb676ba7", user: "admin", key: "a2d59cbc51cf8c61b4cb45b7f4f8db80" } at src/mongo/shell/query.js:L78
> 
Mon Jan 12 11:30:20.139 trying reconnect to 127.0.0.1:27017
Mon Jan 12 11:30:20.139 reconnect 127.0.0.1:27017 ok
ecset01:SECONDARY> 

It makes whole cluster as Secondary. Error on MongoLog:

Mon Jan 12 11:30:15.803 [conn5988790] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.98.0.30:51311
9]
Mon Jan 12 11:30:15.803 [conn6128487] replSet replSetReconfig new config saved locally
Mon Jan 12 11:30:15.803 [conn6124807] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1421066044000|13 } } cursorid:6909070915536108841 ntorett
urn:0 keyUpdates:0 numYields: 1 locks(micros) r:132 nreturned:1 reslen:117 1054ms
Mon Jan 12 11:30:15.803 [conn6128487] command admin.$cmd command: { replSetReconfig: { _id: "ecset01", version: 26, members: [ { _id: 1, host: "10.99
8.2.21:27017", priority: 2.0 }, { _id: 2, host: "10.98.2.22:27017", priority: 3.0 }, { _id: 3, host: "10.98.2.20:27017", priority: 4.0 }, { _id: 4,  
host: "10.98.141.170:27017", priority: 0.0, hidden: true }, { _id: 5, host: "10.98.141.120:27017", priority: 0.0, slaveDelay: 86400, hidden: true },,
 { _id: 6, host: "10.10.14.158:27017", priority: 0.0, slaveDelay: 0, hidden: true } ] } } ntoreturn:1 keyUpdates:0 locks(micros) W:977929 reslen:37  
1065ms
Mon Jan 12 11:30:15.803 BackgroundJob starting: rsHealthPoll
Mon Jan 12 11:30:15.803 [conn6128487] Socket say send() errno:9 Bad file descriptor 127.0.0.1:35581
Mon Jan 12 11:30:15.803 [conn6121755] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1421065202000|5 } } cursorid:6905521952664444159 ntoretuu
rn:0 keyUpdates:0 numYields: 1 locks(micros) r:141 nreturned:1 reslen:37 1018ms
Mon Jan 12 11:30:15.803 [conn6116610] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1421062683000|18 } } cursorid:6894912989969242768 ntorett
urn:0 keyUpdates:0 numYields: 1 locks(micros) r:133 nreturned:1 reslen:37 1038ms
Mon Jan 12 11:30:15.803 [conn6128487] SocketException handling request, closing client connection: 9001 socket exception [2] server [127.0.0.1:355811
]
Mon Jan 12 11:30:15.803 [conn6116610] Socket say send() errno:9 Bad file descriptor 10.98.141.170:56737
Mon Jan 12 11:30:15.803 [conn6124807] Socket say send() errno:9 Bad file descriptor 10.98.2.22:52034
Mon Jan 12 11:30:15.803 [conn6121755] Socket say send() errno:9 Bad file descriptor 10.98.2.21:43302
Mon Jan 12 11:30:15.803 [slaveTracking] User Assertion: 11000:E11000 duplicate key error index: local.slaves.$_id_  dup key: { : ObjectId('53aa302c77
1e48a32f08ebb5d') }
Mon Jan 12 11:30:15.804 [conn6124807] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.98.2.22:52033
4]
Mon Jan 12 11:30:15.804 [conn6121755] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.98.2.21:43300
2]
Mon Jan 12 11:30:15.804 [conn6116610] SocketException handling request, closing client connection: 9001 socket exception [2] server [10.98.141.170:55
6737]
Mon Jan 12 11:30:15.804 [slaveTracking] update local.slaves query: { _id: ObjectId('53aa302c71e48a32f08ebb5d'), config: { _id: 1, host: "10.98.2.21::
27017", priority: 2.0 }, ns: "local.oplog.rs" } update: { $set: { syncedTo: Timestamp 1421080214000|28 } } nscanned:1 fastmodinsert:1 keyUpdates:0 ee
xception: E11000 duplicate key error index: local.slaves.$_id_  dup key: { : ObjectId('53aa302c71e48a32f08ebb5d') } code:11000 locks(micros) w:416 00
ms

We are able to make Primary after shutting down "host" : "10.10.14.158:27017". We like to know why MongoDB behaves like this.



 Comments   
Comment by Ramon Fernandez Marina [ 02/Mar/15 ]

sarbamangal, we haven't heard back from you for a while. If this is still an issue for you, can you please send the full logs for all the servers above from startup until the problem appears?

Thanks,
Ramón.

Comment by Ramon Fernandez Marina [ 15/Jan/15 ]

I don't see the "E11000 duplicate key error" message in the logs above, so this appears to be a separate problem. Can you please send full logs for all the servers above from startup until the problem appears?

Note also that other nodes may also contain a "local.slaves" collection if they were the primary node of your replica set at any given time. You may also want to flush those as well.

Comment by Sarbamangal Choudhury [ 13/Jan/15 ]

I did this. But we have the same issue. It is happening for one node only.

Error on this node :

Tue Jan 13 09:35:14.919 [rsHealthPoll] couldn't connect to 10.98.2.20:27017: couldn't connect to server 10.98.2.20:27017
Tue Jan 13 09:35:14.919 BackgroundJob starting: ConnectBG
Tue Jan 13 09:35:15.575 [rsHealthPoll] replset info 10.98.2.22:27017 heartbeat failed, retrying
Tue Jan 13 09:35:15.575 BackgroundJob starting: ConnectBG
Tue Jan 13 09:35:15.758 [rsHealthPoll] couldn't connect to 10.98.2.21:27017: couldn't connect to server 10.98.2.21:27017
Tue Jan 13 09:35:15.927 [rsHealthPoll] couldn't connect to 10.98.2.20:27017: couldn't connect to server 10.98.2.20:27017
Tue Jan 13 09:35:15.927 BackgroundJob starting: ConnectBG
Tue Jan 13 09:35:16.026 [conn7440] run command admin.$cmd { replSetHeartbeat: "ecset01", v: 29, pv: 1, checkEmpty: false, from: "10.98.2.21:27017" }
Tue Jan 13 09:35:16.026 [conn7440] command admin.$cmd command: { replSetHeartbeat: "ecset01", v: 29, pv: 1, checkEmpty: false, from: "10.98.2.21:27017" } ntoreturn:1 keyUpdates:0  reslen:178 0ms
Tue Jan 13 09:35:16.098 [conn7437] run command admin.$cmd { replSetHeartbeat: "ecset01", v: 29, pv: 1, checkEmpty: false, from: "10.98.141.120:27017" }
Tue Jan 13 09:35:16.098 [conn7437] command admin.$cmd command: { replSetHeartbeat: "ecset01", v: 29, pv: 1, checkEmpty: false, from: "10.98.141.120:27017" } ntoreturn:1 keyUpdates:0  reslen:158 0ms
Tue Jan 13 09:35:16.284 [conn7441] run command admin.$cmd { replSetHeartbeat: "ecset01", v: 29, pv: 1, checkEmpty: false, from: "10.98.2.22:27017" }
Tue Jan 13 09:35:16.284 [conn7441] command admin.$cmd command: { replSetHeartbeat: "ecset01", v: 29, pv: 1, checkEmpty: false, from: "10.98.2.22:27017" } ntoreturn:1 keyUpdates:0  reslen:178 0ms
Tue Jan 13 09:35:16.593 [rsHealthPoll] couldn't connect to 10.98.2.22:27017: couldn't connect to server 10.98.2.22:27017
Tue Jan 13 09:35:16.593 BackgroundJob starting: ConnectBG
Tue Jan 13 09:35:16.795 [conn7439] run command admin.$cmd { replSetHeartbeat: "ecset01", v: 29, pv: 1, checkEmpty: false, from: "10.98.141.170:27017" }
Tue Jan 13 09:35:16.795 [conn7439] command admin.$cmd command: { replSetHeartbeat: "ecset01", v: 29, pv: 1, checkEmpty: false, from: "10.98.141.170:27017" } ntoreturn:1 keyUpdates:0  reslen:158 0ms
Tue Jan 13 09:35:16.834 [conn7438] run command admin.$cmd { replSetHeartbeat: "ecset01", v: 29, pv: 1, checkEmpty: false, from: "10.98.2.20:27017" }
Tue Jan 13 09:35:16.834 [conn7438] command admin.$cmd command: { replSetHeartbeat: "ecset01", v: 29, pv: 1, checkEmpty: false, from: "10.98.2.20:27017" } ntoreturn:1 keyUpdates:0  reslen:178 0ms
Tue Jan 13 09:35:16.940 BackgroundJob starting: ConnectBG
Tue Jan 13 09:35:17.601 [rsHealthPoll] couldn't connect to 10.98.2.22:27017: couldn't connect to server 10.98.2.22:27017
Tue Jan 13 09:35:17.601 BackgroundJob starting: ConnectBG

Comment by Ramon Fernandez Marina [ 12/Jan/15 ]

sarbamangal, please review the documentation on local.slaves for more details on what happens when the collection is dropped.

Comment by Sarbamangal Choudhury [ 12/Jan/15 ]

Here is the information from
ecset01:PRIMARY> db.slaves.find().pretty();
{
"_id" : ObjectId("539639a230fdaa2e0a1a6252"),
"config" :

{ "_id" : 4, "host" : "10.98.141.170:27017", "priority" : 0, "hidden" : true }

,
"ns" : "local.oplog.rs",
"syncedTo" :

{ "t" : 1403479862, "i" : 20 }

}
{
"_id" : ObjectId("502377d7973506627217ad2c"),
"config" :

{ "_id" : 1, "host" : "10.98.2.21:27017" }

,
"ns" : "local.oplog.rs",
"syncedTo" :

{ "t" : 1403640972, "i" : 48 }

}
{
"_id" : ObjectId("50237f40811578c8407c4013"),
"config" :

{ "_id" : 2, "host" : "10.98.2.22:27017" }

,
"ns" : "local.oplog.rs",
"syncedTo" :

{ "t" : 1403434502, "i" : 6 }

}
{
"_id" : ObjectId("53a7ffeff9dec7f137283a2c"),
"config" :

{ "_id" : 4, "host" : "10.98.141.170:27017", "priority" : 0, "hidden" : true }

,
"ns" : "local.oplog.rs",
"syncedTo" :

{ "t" : 1403687295, "i" : 1 }

}
{
"_id" : ObjectId("53a8d6f696ad4ac732a931d8"),
"config" :

{ "_id" : 2, "host" : "10.98.2.22:27017" }

,
"ns" : "local.oplog.rs",
"syncedTo" :

{ "t" : 1420920928, "i" : 1 }

}
{
"_id" : ObjectId("53aa302c71e48a32f08ebb5d"),
"config" :

{ "_id" : 1, "host" : "10.98.2.21:27017" }

,
"ns" : "local.oplog.rs",
"syncedTo" :

{ "t" : 1403687283, "i" : 27 }

}
{
"_id" : ObjectId("53ab0302ec35a48fd0312dfc"),
"config" :

{ "_id" : 4, "host" : "10.98.141.170:27017", "priority" : 0, "hidden" : true }

,
"ns" : "local.oplog.rs",
"syncedTo" :

{ "t" : 1421102614, "i" : 2 }

}
{
"_id" : ObjectId("53d6e1e4405531a76ae81db2"),
"config" :

{ "_id" : 5, "host" : "10.98.141.120:27017", "priority" : 0, "slaveDelay" : 86400, "hidden" : true }

,
"ns" : "local.oplog.rs",
"syncedTo" :

{ "t" : 1418881657, "i" : 50 }

}
{
"_id" : ObjectId("549478f8e2a0e87c18167a7f"),
"config" :

{ "_id" : 5, "host" : "10.98.141.120:27017", "priority" : 0, "slaveDelay" : 86400, "hidden" : true }

,
"ns" : "local.oplog.rs",
"syncedTo" :

{ "t" : 1420804683, "i" : 18 }

}
{
"_id" : ObjectId("54b1c8274385133608f1c691"),
"config" :

{ "_id" : 6, "host" : "10.10.14.158:27017", "priority" : 0, "slaveDelay" : 86400, "hidden" : true }

,
"ns" : "local.oplog.rs",
"syncedTo" :

{ "t" : 1421080215, "i" : 2 }

}
ecset01:PRIMARY>

But ecset01:PRIMARY> rs.conf()
{
"_id" : "ecset01",
"version" : 26,
"members" : [

{ "_id" : 1, "host" : "10.98.2.21:27017", "priority" : 2 }

,

{ "_id" : 2, "host" : "10.98.2.22:27017", "priority" : 3 }

,

{ "_id" : 3, "host" : "10.98.2.20:27017", "priority" : 4 }

,

{ "_id" : 4, "host" : "10.98.141.170:27017", "priority" : 0, "hidden" : true }

,

{ "_id" : 5, "host" : "10.98.141.120:27017", "priority" : 0, "slaveDelay" : 86400, "hidden" : true }

,

{ "_id" : 6, "host" : "10.10.14.158:27017", "priority" : 0, "hidden" : true }

]
}

This is production database. Please confirm that dropping "slaves" collection on Primary will not cause any issue.

Comment by Ramon Fernandez Marina [ 12/Jan/15 ]

The last line in the logs above point to SERVER-9004. Can you please drop the "slaves" collection from the local db as described here and try again?

Generated at Thu Feb 08 03:42:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.