-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
Affects Version/s: 2.4.6
-
Component/s: Replication
-
Environment:Ubuntu 12.04LTS
-
Replication
-
Fully Compatible
-
ALL
-
(copied to CRM)
This is a continuation of SERVER-9464 which was (apparently) not really fixed, see the original case.
This appears to be a deadlock between priority and freshness.
When perusing the original case's repro script after db0 is killed (Waiting for db1 to become primary):
db1 log is filled with:
[rsMgr] not electing self, we are not freshest
db2 log is filled with:
[rsMgr] not electing self, 127.0.0.1:30001 would veto with '127.0.0.1:30002 has lower priority than 127.0.0.1:30001'
I guess the load generator timing is important to reproducability. If db2 replicates from db0 faster than db1 does, it always gets into this deadlock.
test:SECONDARY> rs.status() { "set" : "test", "date" : ISODate("2013-08-26T12:43:18Z"), "myState" : 2, "members" : [ { "_id" : 0, "name" : "127.0.0.1:30000", "health" : 0, "state" : 8, "stateStr" : "(not reachable/healthy)", "uptime" : 0, "optime" : Timestamp(1377520302, 6340), "optimeDate" : ISODate("2013-08-26T12:31:42Z"), "lastHeartbeat" : ISODate("2013-08-26T12:43:17Z"), "lastHeartbeatRecv" : ISODate("2013-08-26T12:31:42Z"), "pingMs" : 0 }, { "_id" : 1, "name" : "127.0.0.1:30001", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 730, "optime" : Timestamp(1377520303, 6044), "optimeDate" : ISODate("2013-08-26T12:31:43Z"), "self" : true }, { "_id" : 2, "name" : "127.0.0.1:30002", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 718, "optime" : Timestamp(1377520303, 6045), "optimeDate" : ISODate("2013-08-26T12:31:43Z"), "lastHeartbeat" : ISODate("2013-08-26T12:43:17Z"), "lastHeartbeatRecv" : ISODate("2013-08-26T12:43:17Z"), "pingMs" : 0, "syncingTo" : "127.0.0.1:30000" } ], "ok" : 1 }
Also, while this might be unrelated, there are following entries in the log of db1:
[rsSyncNotifier] replset setting oplog notifier to 127.0.0.1:30000
[rsSyncNotifier] Socket flush send() errno:9 Bad file descriptor 127.0.0.1:30000
[rsSyncNotifier] caught exception (socket exception [SEND_ERROR] for 127.0.0.1:30000) in destructor (~PiggyBackData)
I'm not sure what's your policy on exceptions in destructors of course, but this might be an indicator of an improperly-cleaned up connection.
- related to
-
SERVER-9464 On election, relax priority restriction when another member is fresher
- Closed