-
Type:
Bug
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: 1.8.0-rc0, 1.8.0-rc1
-
Component/s: Admin, Replication, Usability
-
None
-
Environment:Ubuntu 10 64 bit, 8gig memory... too much disk to worry about
-
Linux
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Firstly...a s a new user... brilliant package.... thanks. (And stupidly I posted this on the Ubuntu/mongo log as well... sorry... monday morning syndrome)
Now.. I have 6 instances in a replication set, spread over 2 physical machines. All works fine. If I then take down one of the machines, I end up with 3 instances, all being secondaries. This is a basic setup with default voting rights, and no arbiter.
The result of a rs.status() is below:
mycache:SECONDARY> rs.status()
{
"set" : "mycache",
"date" : ISODate("2011-03-04T15:49:01Z"),
"myState" : 2,
"members" : [
{
"_id" : 0,
"name" : "n.n.n.1:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 202,
"optime" :
,
"optimeDate" : ISODate("2011-03-04T14:50:55Z"),
"lastHeartbeat" : ISODate("2011-03-04T15:49:01Z")
},
{
"_id" : 1,
"name" : "n.n.n.2:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"optime" :
,
"optimeDate" : ISODate("2011-03-04T14:50:55Z"),
"self" : true
},
{
"_id" : 2,
"name" : "n.n.n.3:27019",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 202,
"optime" :
,
"optimeDate" : ISODate("2011-03-04T14:50:55Z"),
"lastHeartbeat" : ISODate("2011-03-04T15:49:01Z")
},
{
"_id" : 3,
"name" : "n.n.1.1:27017",
"health" : 0,
"state" : 2,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :
,
"optimeDate" : ISODate("2011-03-04T14:50:55Z"),
"lastHeartbeat" : ISODate("2011-03-04T15:46:45Z"),
"errmsg" : "socket exception"
},
{
"_id" : 4,
"name" : "n.n.1.2:27018",
"health" : 0,
"state" : 1,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :
,
"optimeDate" : ISODate("2011-03-04T14:50:55Z"),
"lastHeartbeat" : ISODate("2011-03-04T15:46:45Z"),
"errmsg" : "socket exception"
},
{
"_id" : 5,
"name" : "n.n.1.3:27019",
"health" : 0,
"state" : 2,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :
,
"optimeDate" : ISODate("2011-03-04T14:50:55Z"),
"lastHeartbeat" : ISODate("2011-03-04T15:46:45Z"),
"errmsg" : "socket exception"
}
],
"ok" : 1
}
1. I tried reconfig, but that needs a primary, which I don't have.
2. Tried taking an instance down, freezing the other two, and bringing the third back up.... came back as a secondary.
3. Am going to try creating a new instance, and setting up as an arbiter, to see if that can help find a primary. However, this is not a long term solution. (see 4 below)
4. If I have more than one machine taking part in a replication set, in theory, for a resilient system, each machine would need to have an arbiter, in case another machine got taken out. With an even number of machines, that gives uis an even number of arbiters, which doesn't help if they are all in play (unless I am missing something obvious.... not for teh first time ).
If, however, we assign bitwise voting rights to each instance in a replicatuion set (1,2,4,8,16 .....), then any instance can be downed, or a whole machine can be downed, and a definite primary will also be voted in. This removes the need for an arbiter, and also gives the admins a chance to prioritise the servers taking part.... but I need a primary to change the config.
Thanks in advance for any help