Details
-
Bug
-
Status: Closed
-
Major - P3
-
Resolution: Done
-
2.4.3
-
AMD Turion(tm) II Dual-Core Mobile M540, 4GB RAM, Ubuntu 13.04, kernel 3.5.0-28-generic
-
Fully Compatible
-
ALL
-
Description
Hello,
Hope this is not a duplicated issue.
Let's assume there are 3 nodes in replica set (#1, #2, #3).
> rs.status()
|
{
|
"set" : "set0",
|
"date" : ISODate("2013-05-23T12:48:07Z"),
|
"myState" : 2,
|
"syncingTo" : "krogon:27001",
|
"members" : [
|
{
|
"_id" : 0,
|
"name" : "krogon:27001",
|
"health" : 1,
|
"state" : 1,
|
"stateStr" : "PRIMARY",
|
"uptime" : 29,
|
"optime" : {
|
"t" : 1369313180,
|
"i" : 2
|
},
|
"optimeDate" : ISODate("2013-05-23T12:46:20Z"),
|
"lastHeartbeat" : ISODate("2013-05-23T12:48:06Z"),
|
"lastHeartbeatRecv" : ISODate("2013-05-23T12:48:07Z"),
|
"pingMs" : 0
|
},
|
{
|
"_id" : 1,
|
"name" : "krogon:27002",
|
"health" : 1,
|
"state" : 2,
|
"stateStr" : "SECONDARY",
|
"uptime" : 29,
|
"optime" : {
|
"t" : 1369313180,
|
"i" : 2
|
},
|
"optimeDate" : ISODate("2013-05-23T12:46:20Z"),
|
"lastHeartbeat" : ISODate("2013-05-23T12:48:06Z"),
|
"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
|
"pingMs" : 0,
|
"lastHeartbeatMessage" : "syncing to: krogon:27001",
|
"syncingTo" : "krogon:27001"
|
},
|
{
|
"_id" : 2,
|
"name" : "krogon:27003",
|
"health" : 1,
|
"state" : 2,
|
"stateStr" : "SECONDARY",
|
"uptime" : 31,
|
"optime" : {
|
"t" : 1369313180,
|
"i" : 2
|
},
|
"optimeDate" : ISODate("2013-05-23T12:46:20Z"),
|
"errmsg" : "syncing to: krogon:27001",
|
"self" : true
|
}
|
],
|
"ok" : 1
|
}
|
When there is a network connectivity problem to primary node (#1), election is taking place and new primary is being designed (#3). Then inserting new data will update optime on #2 and #3. Old primary is alive but cannot communicate with other nodes.
> rs.status()
|
{
|
"set" : "set0",
|
"date" : ISODate("2013-05-23T12:55:44Z"),
|
"myState" : 1,
|
"members" : [
|
{
|
"_id" : 0,
|
"name" : "krogon:27001",
|
"health" : 0,
|
"state" : 8,
|
"stateStr" : "(not reachable/healthy)",
|
"uptime" : 0,
|
"optime" : {
|
"t" : 1369313180,
|
"i" : 2
|
},
|
"optimeDate" : ISODate("2013-05-23T12:46:20Z"),
|
"lastHeartbeat" : ISODate("2013-05-23T12:55:13Z"),
|
"lastHeartbeatRecv" : ISODate("2013-05-23T12:55:43Z"),
|
"pingMs" : 0
|
},
|
{
|
"_id" : 1,
|
"name" : "krogon:27002",
|
"health" : 1,
|
"state" : 2,
|
"stateStr" : "SECONDARY",
|
"uptime" : 486,
|
"optime" : {
|
"t" : 1369313742,
|
"i" : 1
|
},
|
"optimeDate" : ISODate("2013-05-23T12:55:42Z"),
|
"lastHeartbeat" : ISODate("2013-05-23T12:55:43Z"),
|
"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
|
"pingMs" : 0,
|
"syncingTo" : "krogon:27003"
|
},
|
{
|
"_id" : 2,
|
"name" : "krogon:27003",
|
"health" : 1,
|
"state" : 1,
|
"stateStr" : "PRIMARY",
|
"uptime" : 488,
|
"optime" : {
|
"t" : 1369313742,
|
"i" : 1
|
},
|
"optimeDate" : ISODate("2013-05-23T12:55:42Z"),
|
"self" : true
|
}
|
],
|
"ok" : 1
|
}
|
Log from #1:
[rsHealthPoll] replset info krogon:27002 thinks that we are down
|
[rsHealthPoll] replset info krogon:27003 thinks that we are down
|
Log from #2:
[rsHealthPoll] replset info krogon:27001 heartbeat failed, retrying
|
[rsHealthPoll] replSet info krogon:27001 is down (or slow to respond):
|
[rsHealthPoll] replSet member krogon:27001 is now in state DOWN
|
[rsMgr] not electing self, krogon:27003 would veto with 'krogon:27002 is trying to elect itself but krogon:27001 is already primary and more up-to-date'
|
[rsMgr] not electing self, krogon:27003 would veto with 'krogon:27002 is trying to elect itself but krogon:27001 is already primary and more up-to-date'
|
[rsMgr] not electing self, krogon:27003 would veto with 'krogon:27002 is trying to elect itself but krogon:27001 is already primary and more up-to-date'
|
[rsHealthPoll] replset info krogon:27001 heartbeat failed, retrying
|
[rsMgr] not electing self, krogon:27003 would veto with 'krogon:27002 is trying to elect itself but krogon:27001 is already primary and more up-to-date'
|
[rsBackgroundSync] Socket recv() timeout 127.0.1.1:27001
|
[rsBackgroundSync] SocketException: remote: 127.0.1.1:27001 error: 9001 socket exception [3] server [127.0.1.1:27001]
|
[rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: krogon:27001
|
[conn9] replSet info voting yea for krogon:27003 (2)
|
[rsMgr] replSet not trying to elect self as responded yea to someone else recently
|
[rsHealthPoll] replset info krogon:27001 heartbeat failed, retrying
|
[rsMgr] not electing self, krogon:27003 would veto with 'I am already primary, krogon:27002 can try again once I've stepped down'
|
[rsHealthPoll] replSet member krogon:27003 is now in state PRIMARY
|
[rsBackgroundSync] replSet syncing to: krogon:27003
|
[rsHealthPoll] replset info krogon:27001 heartbeat failed, retrying
|
[rsSyncNotifier] replset setting oplog notifier to krogon:27003
|
[rsHealthPoll] replset info krogon:27001 heartbeat failed, retrying
|
[rsHealthPoll] couldn't connect to krogon:27001: couldn't connect to server krogon:27001
|
Logs from #3:
[rsHealthPoll] couldn't connect to krogon:27001: couldn't connect to server krogon:27001
|
[rsHealthPoll] replSet info krogon:27001 is down (or slow to respond):
|
[rsHealthPoll] replSet member krogon:27001 is now in state DOWN
|
[rsBackgroundSync] Socket recv() timeout 127.0.1.1:27001
|
[rsBackgroundSync] SocketException: remote: 127.0.1.1:27001 error: 9001 socket exception [3] server [127.0.1.1:27001]
|
[rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: krogon:27001
|
[rsMgr] replSet info electSelf 2
|
[rsMgr] replSet PRIMARY
|
[rsHealthPoll] couldn't connect to krogon:27001: couldn't connect to server krogon:27001
|
Now the connectivity is back. Election is taking part. Old primary should (#1) rollback data which was not replicated and become secondary. After become secondary it should download data from new master (#3), which has higher optime.
Hovewer during election old primary (#1) becomes primary again. New primary (#3) after swiching to secondary is syncing to old primary (#1) with older optime which causes data loss.
> rs.status()
|
{
|
"set" : "set0",
|
"date" : ISODate("2013-05-23T12:59:37Z"),
|
"myState" : 2,
|
"syncingTo" : "krogon:27001",
|
"members" : [
|
{
|
"_id" : 0,
|
"name" : "krogon:27001",
|
"health" : 1,
|
"state" : 1,
|
"stateStr" : "PRIMARY",
|
"uptime" : 59,
|
"optime" : {
|
"t" : 1369313180,
|
"i" : 2
|
},
|
"optimeDate" : ISODate("2013-05-23T12:46:20Z"),
|
"lastHeartbeat" : ISODate("2013-05-23T12:59:36Z"),
|
"lastHeartbeatRecv" : ISODate("2013-05-23T12:59:37Z"),
|
"pingMs" : 0
|
},
|
{
|
"_id" : 1,
|
"name" : "krogon:27002",
|
"health" : 1,
|
"state" : 2,
|
"stateStr" : "SECONDARY",
|
"uptime" : 719,
|
"optime" : {
|
"t" : 1369313180,
|
"i" : 2
|
},
|
"optimeDate" : ISODate("2013-05-23T12:46:20Z"),
|
"lastHeartbeat" : ISODate("2013-05-23T12:59:37Z"),
|
"lastHeartbeatRecv" : ISODate("1970-01-01T00:00:00Z"),
|
"pingMs" : 0,
|
"syncingTo" : "krogon:27001"
|
},
|
{
|
"_id" : 2,
|
"name" : "krogon:27003",
|
"health" : 1,
|
"state" : 2,
|
"stateStr" : "SECONDARY",
|
"uptime" : 721,
|
"optime" : {
|
"t" : 1369313180,
|
"i" : 2
|
},
|
"optimeDate" : ISODate("2013-05-23T12:46:20Z"),
|
"errmsg" : "syncing to: krogon:27001",
|
"self" : true
|
}
|
],
|
"ok" : 1
|
}
|
LOGS from #3:
[rsHealthPoll] couldn't connect to krogon:27001: couldn't connect to server krogon:27001
|
[rsHealthPoll] replSet member krogon:27001 is up
|
[rsHealthPoll] replSet member krogon:27001 is now in state PRIMARY
|
[rsMgr] replSet relinquishing primary state
|
[rsMgr] replSet SECONDARY
|
[rsMgr] replSet closing client sockets after relinquishing primary
|
[rsBackgroundSync] replSet syncing to: krogon:27001
|
[rsHealthPoll] replset info krogon:27001 heartbeat failed, retrying
|
[conn14] SocketException handling request, closing client connection: 9001 socket exception [2] server [127.0.0.1:53226]
|
[rsBackgroundSync] replSet we are ahead of the sync source, will try to roll back
|
[rsBackgroundSync] replSet rollback 0
|
[rsBackgroundSync] replSet ROLLBACK
|
[rsBackgroundSync] replSet rollback 1
|
[rsBackgroundSync] replSet rollback 2 FindCommonPoint
|
[rsBackgroundSync] replSet info rollback our last optime: May 23 14:56:36:1
|
[rsBackgroundSync] replSet info rollback their last optime: May 23 14:46:20:2
|
[rsBackgroundSync] replSet info rollback diff in end of log times: 616 seconds
|
[rsBackgroundSync] replSet rollback found matching events at May 23 14:46:20:2
|
[rsBackgroundSync] replSet rollback findcommonpoint scanned : 142
|
[rsBackgroundSync] replSet replSet rollback 3 fixup
|
[rsBackgroundSync] replSet rollback 3.5
|
[rsBackgroundSync] replSet rollback 4 n:100
|
[rsBackgroundSync] replSet minvalid=May 23 14:46:20 519e0f9c:2
|
[rsBackgroundSync] replSet rollback 4.6
|
[rsBackgroundSync] replSet rollback 4.7
|
[rsBackgroundSync] replSet rollback 5 d:0 u:100
|
[rsBackgroundSync] replSet rollback 6
|
[rsBackgroundSync] replSet rollback 7
|
[rsBackgroundSync] replSet rollback done
|
[rsBackgroundSync] replSet RECOVERING
|
[rsBackgroundSync] replSet syncing to: krogon:27001
|
[rsSync] replSet SECONDARY
|
[rsBackgroundSync] replSet syncing to: krogon:27001
|
Attachments
Issue Links
- is duplicated by
-
SERVER-10768 add proper support for SIGSTOP and SIGCONT (currently, on replica set primary can cause data loss)
-
- Closed
-
-
SERVER-8145 Two primaries for the same replica set
-
- Closed
-
-
SERVER-10575 Two Primaries if Replica Set Heartbeats Fail in One Direction: "Old" Primary doesn't relinquish.
-
- Closed
-
- related to
-
SERVER-9283 lastHeartbeatRecv flaps between invalid and valid dates
-
- Closed
-