Loading...

Type: Bug
Resolution: Incomplete
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.4.3
Component/s: Replication
Labels:
None
Environment:
centos 6.2

Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

With a 3 node replicaset we had a presumed network partition cause an odd issue.

given n1, n2, n3 with n1 being the PRIMARY we saw the following in our logs:

Tue May 21 9:53:49.372 node-two [rsHealthPoll] replSet info node-one:27017 is down (or slow to respond):
Tue May 21 9:53:49.372 node-two [rsHealthPoll] replSet member node-one:27017 is now in state DOWN
Tue May 21 9:53:49.372 node-two [rsMgr] not electing self, node-three:27017 would veto with 'node-two:27017 is trying to elect itself but node-one:27017 is already primary and more up-to-date'
Tue May 21 9:53:50.425 node-two [rsBackgroundSync] replSet syncing to: node-three:27017
Tue May 21 9:53:50.442 node-two [rsSyncNotifier] replset setting oplog notifier to node-three:27017
Tue May 21 9:53:50.460 node-three [rsGhostSync] handshake between 1 and node-one:27017
Tue May 21 9:53:55.378 node-two [rsMgr] not electing self, node-three:27017 would veto with 'node-two:27017 is trying to elect itself but node-one:27017 is already primary and more up-to-date'
Tue May 21 9:54:01.390 node-two [rsMgr] not electing self, node-three:27017 would veto with 'node-two:27017 is trying to elect itself but node-one:27017 is already primary and more up-to-date'
Tue May 21 9:54:07.394 node-two [rsMgr] not electing self, node-three:27017 would veto with 'node-two:27017 is trying to elect itself but node-one:27017 is already primary and more up-to-date'
Tue May 21 9:54:13.400 node-two [rsMgr] not electing self, node-three:27017 would veto with 'node-two:27017 is trying to elect itself but node-one:27017 is already primary and more up-to-date'
Tue May 21 9:54:19.407 node-two [rsMgr] not electing self, node-three:27017 would veto with 'node-two:27017 is trying to elect itself but node-one:27017 is already primary and more up-to-date'
Tue May 21 9:54:25.416 node-two [rsMgr] not electing self, node-three:27017 would veto with 'node-two:27017 is trying to elect itself but node-one:27017 is already primary and more up-to-date'
Tue May 21 9:54:31.421 node-two [rsMgr] not electing self, node-three:27017 would veto with 'node-two:27017 is trying to elect itself but node-one:27017 is already primary and more up-to-date'
Tue May 21 9:54:37.427 node-two [rsMgr] not electing self, we are not freshest
Tue May 21 9:54:38.027 node-three [rsHealthPoll] replSet info node-one:27017 is down (or slow to respond):
Tue May 21 9:54:38.027 node-three [rsHealthPoll] replSet member node-one:27017 is now in state DOWN
Tue May 21 9:54:38.219 node-three [rsMgr] replSet info electSelf 2
Tue May 21 9:54:39.186 node-three [rsMgr] replSet PRIMARY
Tue May 21 9:54:39.326 node-one [rsHealthPoll] replSet member node-three:27017 is now in state PRIMARY
Tue May 21 9:54:39.718 node-two [rsHealthPoll] replSet member node-three:27017 is now in state PRIMARY
...
(these happened during the entire time every few seconds)
Tue May 21 14:43:04.286 node-one [rsHealthPoll] DBClientCursor::init call() failed
Tue May 21 14:43:04.286 node-one [rsHealthPoll] replset info node-one:27017 heartbeat failed, retrying
Tue May 21 14:43:04.286 node-two [rsHealthPoll] DBClientCursor::init call() failed
Tue May 21 14:43:04.286 node-two [rsHealthPoll] replset info node-one:27017 heartbeat failed, retrying
Tue May 21 14:44:57.996 node-one [rsHealthPoll] replset info node-three:27017 thinks that we are down
Tue May 21 14:44:59.919 node-one [rsHealthPoll] replset info node-two:27017 thinks that we are down
...
Tue May 21 11:46:24.143 node-two [rsHealthPoll] replSet member node-one:27017 is now in state PRIMARY
Tue May 21 11:46:24.143 node-two [rsHealthPoll] replSet member node-one:27017 is up
Tue May 21 11:46:24.143 node-two [rsMgr] replSet info two primaries (transiently)
Tue May 21 11:46:24.528 node-three [rsHealthPoll] replSet member node-one:27017 is now in state PRIMARY
Tue May 21 11:46:24.528 node-three [rsHealthPoll] replSet member node-one:27017 is up
Tue May 21 11:46:24.754 node-three [rsMgr] replSet closing client sockets after relinquishing primary
Tue May 21 11:46:24.754 node-three [rsMgr] replSet relinquishing primary state
Tue May 21 11:46:24.754 node-three [rsMgr] replSet SECONDARY
Tue May 21 11:46:24.759 node-two [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: node-three:27017
Tue May 21 11:46:24.782 node-two [rsSyncNotifier] replset tracking exception: exception: 10278 dbclient error communicating with server: node-three:27017
Tue May 21 11:46:25.726 node-three [rsBackgroundSync] replSet syncing to: node-one:27017
Tue May 21 11:46:25.729 node-three [rsBackgroundSync] replSet our last op time fetched: May 21 14:46:24:a
Tue May 21 11:46:25.729 node-three [rsBackgroundSync] replset source's GTE: May 21 14:46:25:1
Tue May 21 11:46:25.744 node-three [rsBackgroundSync] replSet ROLLBACK
Tue May 21 11:46:25.744 node-three [rsBackgroundSync] replSet rollback 0
Tue May 21 11:46:25.744 node-three [rsBackgroundSync] replSet rollback 1
Tue May 21 11:46:25.744 node-three [rsBackgroundSync] replSet rollback 2 FindCommonPoint
Tue May 21 11:46:25.745 node-three [rsBackgroundSync] replSet info rollback diff in end of log times: -1 seconds
Tue May 21 11:46:25.745 node-three [rsBackgroundSync] replSet info rollback our last optime: May 21 14:46:24:a
Tue May 21 11:46:25.745 node-three [rsBackgroundSync] replSet info rollback their last optime: May 21 14:46:25:2
Tue May 21 11:46:26.135 node-one [rsHealthPoll] replSet member node-three:27017 is now in state ROLLBACK
Tue May 21 11:46:26.171 node-two [rsHealthPoll] replSet member node-three:27017 is now in state ROLLBACK
Tue May 21 11:46:28.545 node-three [rsBackgroundSync] replSet replSet rollback 3 fixup
Tue May 21 11:46:28.545 node-three [rsBackgroundSync] replSet rollback findcommonpoint scanned : 775735
Tue May 21 11:46:28.545 node-three [rsBackgroundSync] replSet rollback found matching events at May 21 12:54:38:2
Tue May 21 11:46:34.759 node-two [rsBackgroundSync] replSet syncing to: node-one:27017
Tue May 21 11:46:34.766 node-two [rsBackgroundSync] replSet our last op time fetched: May 21 14:46:24:a
Tue May 21 11:46:34.766 node-two [rsBackgroundSync] replset source's GTE: May 21 14:46:25:1
Tue May 21 11:46:34.767 node-two [rsBackgroundSync] replSet ROLLBACK
Tue May 21 11:46:34.767 node-two [rsBackgroundSync] replSet rollback 0
Tue May 21 11:46:34.767 node-two [rsBackgroundSync] replSet rollback 1
Tue May 21 11:46:34.767 node-two [rsBackgroundSync] replSet rollback 2 FindCommonPoint
Tue May 21 11:46:34.770 node-two [rsBackgroundSync] replSet info rollback our last optime: May 21 14:46:24:a
Tue May 21 11:46:34.771 node-two [rsBackgroundSync] replSet info rollback diff in end of log times: -10 seconds
Tue May 21 11:46:34.771 node-two [rsBackgroundSync] replSet info rollback their last optime: May 21 14:46:34:4
Tue May 21 11:46:37.401 node-two [rsBackgroundSync] replSet replSet rollback 3 fixup
Tue May 21 11:46:37.401 node-two [rsBackgroundSync] replSet rollback findcommonpoint scanned : 775974
Tue May 21 11:46:37.401 node-two [rsBackgroundSync] replSet rollback found matching events at May 21 12:54:38:2
Tue May 21 11:46:45.952 node-one [rsHealthPoll] replset info node-two:27017 heartbeat failed, retrying
Tue May 21 11:46:46.953 node-one [rsHealthPoll] replSet info node-two:27017 is down (or slow to respond):
Tue May 21 11:46:46.954 node-one [rsHealthPoll] replSet member node-two:27017 is now in state DOWN
Tue May 21 11:46:47.177 node-three [rsHealthPoll] replSet info node-two:27017 is down (or slow to respond):
Tue May 21 11:46:47.177 node-three [rsHealthPoll] replSet member node-two:27017 is now in state DOWN
Tue May 21 11:48:10.964 node-one [rsHealthPoll] replset info node-two:27017 heartbeat failed, retrying
Tue May 21 11:49:27.780 node-three [rsBackgroundSync] replSet rollback 3.5
Tue May 21 11:49:27.787 node-three [rsBackgroundSync] replSet rollback 4 n:372470
Tue May 21 11:49:27.801 node-three [rsBackgroundSync] replSet minvalid=May 21 14:49:27 519bc1b7:10
Tue May 21 11:49:27.802 node-three [rsBackgroundSync] replSet rollback 4.1 coll resync lucid_prod.UserActivity
Tue May 21 11:49:27.894 node-three [rsBackgroundSync] build index lucid_prod.UserActivity

{ _id: 1 }

Tue May 21 11:49:27.895 node-three [rsBackgroundSync] build index done. scanned 0 total records. 0 secs
Tue May 21 11:49:27.908 node-three [rsBackgroundSync] replSet minvalid=May 21 14:49:27 519bc1b7:12
Tue May 21 11:49:27.908 node-three [rsBackgroundSync] replSet rollback 4.2
Tue May 21 11:49:27.911 node-three [rsBackgroundSync] replSet rollback 4.3
Tue May 21 11:49:27.911 node-three [rsBackgroundSync] replSet rollback 4.6
Tue May 21 11:49:27.911 node-three [rsBackgroundSync] replSet rollback 4.7
Tue May 21 11:49:35.355 node-two [rsBackgroundSync] replSet rollback 3.5
Tue May 21 11:49:35.363 node-two [rsBackgroundSync] replSet minvalid=May 21 14:49:35 519bc1bf:9
Tue May 21 11:49:35.363 node-two [rsBackgroundSync] replSet rollback 4 n:372470
Tue May 21 11:49:35.363 node-two [rsBackgroundSync] replSet rollback 4.1 coll resync lucid_prod.UserActivity
Tue May 21 11:49:35.367 node-two [rsSync] replSet still syncing, not yet to minValid optime 519bc1bf:9
Tue May 21 11:49:35.459 node-one [rsHealthPoll] replSet member node-two:27017 is now in state ROLLBACK
Tue May 21 11:49:35.459 node-one [rsHealthPoll] replSet member node-two:27017 is up
Tue May 21 11:49:35.459 node-three [rsHealthPoll] replSet member node-two:27017 is up
Tue May 21 11:49:35.467 node-two [rsBackgroundSync] build index lucid_prod.UserActivity

{ _id: 1 }

Tue May 21 11:49:35.468 node-two [rsBackgroundSync] build index done. scanned 0 total records. 0.001 secs
Tue May 21 11:49:35.477 node-two [rsBackgroundSync] replSet minvalid=May 21 14:49:35 519bc1bf:c
Tue May 21 11:49:35.477 node-two [rsBackgroundSync] replSet rollback 4.2
Tue May 21 11:49:35.479 node-two [rsBackgroundSync] replSet rollback 4.3
Tue May 21 11:49:35.479 node-two [rsBackgroundSync] replSet rollback 4.6
Tue May 21 11:49:35.479 node-two [rsBackgroundSync] replSet rollback 4.7
Tue May 21 11:49:38.134 node-three [rsBackgroundSync] old journal file will be removed: /mnt/apps/mongo/db/journal/j._9
Tue May 21 11:49:40.222 node-one [rsHealthPoll] replset info node-three:27017 heartbeat failed, retrying
Tue May 21 11:49:41.222 node-one [rsHealthPoll] replSet info node-three:27017 is down (or slow to respond):
Tue May 21 11:49:41.222 node-one [rsHealthPoll] replSet member node-three:27017 is now in state DOWN
Tue May 21 11:49:47.460 node-one [rsHealthPoll] replSet info node-two:27017 is down (or slow to respond):
Tue May 21 11:49:47.460 node-one [rsHealthPoll] replSet member node-two:27017 is now in state DOWN
Tue May 21 11:49:47.460 node-one [rsMgr] can't see a majority of the set, relinquishing primary
Tue May 21 11:49:47.460 node-one [rsMgr] replSet closing client sockets after relinquishing primary
Tue May 21 11:49:47.460 node-one [rsMgr] replSet relinquishing primary state
Tue May 21 11:49:47.460 node-one [rsMgr] replSet SECONDARY
Tue May 21 11:49:48.372 node-two [rsHealthPoll] replSet info node-three:27017 is down (or slow to respond):
Tue May 21 11:49:53.224 node-one [rsMgr] replSet can't see a majority, will not try to elect self
Tue May 21 11:49:54.504 node-three [rsBackgroundSync] Socket recv() errno:104 Connection reset by peer 10.10.1.234:27017
Tue May 21 11:49:54.504 node-three [rsBackgroundSync] SocketException: remote: 10.10.1.234:27017 error: 9001 socket exception [1] server [10.10.1.234:27017]
Tue May 21 11:49:54.530 node-three [rsBackgroundSync] replSet warning rollback error querying for existence of lucid_prod.UserMessageQueue at the primary, ignoring
Tue May 21 11:49:54.558 node-three [rsBackgroundSync] replSet warning rollback error querying for existence of lucid_prod.VerificationToken at the primary, ignoring
Tue May 21 11:49:54.644 node-three [rsBackgroundSync] replSet rollback 5 d:596790 u:70830
Tue May 21 11:49:59.576 node-two [rsBackgroundSync] replSet warning rollback error querying for existence of lucid_prod.UserMessageQueue at the primary, ignoring
Tue May 21 11:49:59.576 node-two [rsBackgroundSync] Socket recv() errno:104 Connection reset by peer 10.10.1.234:27017
Tue May 21 11:49:59.576 node-two [rsBackgroundSync] SocketException: remote: 10.10.1.234:27017 error: 9001 socket exception [1] server [10.10.1.234:27017]
Tue May 21 11:49:59.584 node-two [rsBackgroundSync] replSet warning rollback error querying for existence of lucid_prod.VerificationToken at the primary, ignoring
Tue May 21 11:49:59.586 node-two [rsBackgroundSync] replSet rollback 5 d:596790 u:70830
Tue May 21 11:50:29.230 node-one [rsHealthPoll] replset info node-three:27017 heartbeat failed, retrying
Tue May 21 11:50:48.995 node-three [rsBackgroundSync] replSet rollback 6
Tue May 21 11:50:51.204 node-two [rsBackgroundSync] replSet rollback 6
Tue May 21 11:50:52.301 node-three [rsBackgroundSync] replSet rollback 7
Tue May 21 11:50:54.340 node-two [rsBackgroundSync] replSet rollback 7
Tue May 21 11:51:01.111 node-three [rsBackgroundSync] replSet rollback done
Tue May 21 11:51:01.157 node-three [rsBackgroundSync] replSet RECOVERING
Tue May 21 11:51:01.330 node-three [rsSync] replSet still syncing, not yet to minValid optime 519bc1b7:12
Tue May 21 11:51:01.338 node-three [rsBackgroundSync] replSet syncing to: node-one:27017
Tue May 21 11:51:01.339 node-three [rsHealthPoll] replSet member node-two:27017 is now in state ROLLBACK
Tue May 21 11:51:01.434 node-three [rsSync] replSet still syncing, not yet to minValid optime 519bc1b7:12
Tue May 21 11:51:01.437 node-one [rsHealthPoll] replSet member node-three:27017 is now in state ROLLBACK
Tue May 21 11:51:01.437 node-one [rsHealthPoll] replSet member node-three:27017 is up
Tue May 21 11:51:01.437 node-one [rsMgr] not electing self, node-three:27017 would veto with 'node-one:27017 is trying to elect itself but node-one:27017 is already primary and more up-to-date'
Tue May 21 11:51:01.508 node-three [rsSyncNotifier] replset tracking exception: exception: 9001 socket exception [SEND_ERROR] for 10.10.1.234:27017
Tue May 21 11:51:01.508 node-three [rsSyncNotifier] Socket say send() errno:9 Bad file descriptor 10.10.1.234:27017
Tue May 21 11:51:01.529 node-three [rsSync] replSet SECONDARY
Tue May 21 11:51:01.895 node-two [rsBackgroundSync] replSet rollback done
Tue May 21 11:51:01.940 node-two [rsBackgroundSync] replSet RECOVERING
Tue May 21 11:51:02.002 node-two [rsSync] replSet still syncing, not yet to minValid optime 519bc1bf:c
Tue May 21 11:51:02.003 node-two [rsHealthPoll] replSet member node-three:27017 is now in state DOWN
Tue May 21 11:51:02.017 node-two [rsBackgroundSync] replSet syncing to: node-one:27017
Tue May 21 11:51:02.033 node-one [rsHealthPoll] replSet member node-two:27017 is now in state ROLLBACK
Tue May 21 11:51:02.033 node-one [rsHealthPoll] replSet member node-two:27017 is up
Tue May 21 11:51:02.035 node-one [rsMgr] not electing self, node-three:27017 would veto with 'node-one:27017 is trying to elect itself but node-one:27017 is already primary and more up-to-date'
Tue May 21 11:51:02.327 node-two [rsSync] replSet still syncing, not yet to minValid optime 519bc1bf:c
Tue May 21 11:51:02.335 node-two [rsSyncNotifier] replset setting oplog notifier to node-one:27017
Tue May 21 11:51:02.508 node-three [rsSyncNotifier] caught exception (socket exception [SEND_ERROR] for 10.10.1.234:27017) in destructor (~PiggyBackData)
Tue May 21 11:51:02.508 node-three [rsSyncNotifier] replset setting oplog notifier to node-one:27017
Tue May 21 11:51:02.508 node-three [rsSyncNotifier] Socket flush send() errno:9 Bad file descriptor 10.10.1.234:27017
Tue May 21 11:51:03.334 node-three [rsHealthPoll] replSet member node-one:27017 is now in state SECONDARY
Tue May 21 11:51:03.340 node-three [rsHealthPoll] replSet member node-two:27017 is now in state RECOVERING
Tue May 21 11:51:03.437 node-one [rsHealthPoll] replSet member node-three:27017 is now in state SECONDARY
Tue May 21 11:51:03.438 node-one [rsMgr] not electing self, node-two:27017 would veto with 'node-one:27017 is trying to elect itself but node-one:27017 is already primary and more up-to-date'
Tue May 21 11:51:04.003 node-two [rsHealthPoll] replSet member node-one:27017 is now in state SECONDARY
Tue May 21 11:51:04.017 node-two [rsHealthPoll] replSet member node-three:27017 is now in state SECONDARY
Tue May 21 11:51:04.017 node-two [rsHealthPoll] replSet member node-three:27017 is up
Tue May 21 11:51:04.033 node-one [rsHealthPoll] replSet member node-two:27017 is now in state RECOVERING
Tue May 21 11:51:04.034 node-one [rsMgr] replSet info electSelf 0
Tue May 21 11:51:04.317 node-one [rsMgr] replSet PRIMARY
Tue May 21 11:51:05.334 node-three [rsHealthPoll] replSet member node-one:27017 is now in state PRIMARY
Tue May 21 11:51:06.004 node-two [rsHealthPoll] replSet member node-one:27017 is now in state PRIMARY
Tue May 21 11:51:11.141 node-two [rsSync] replSet SECONDARY
Tue May 21 11:51:11.342 node-three [rsHealthPoll] replSet member node-two:27017 is now in state SECONDARY
Tue May 21 11:51:12.035 node-one [rsHealthPoll] replSet member node-two:27017 is now in state SECONDARY

1. Why would it roll back the majority (N2 and N3) instead of N1?
2. Why wouldn't N1 step down if it knew the other 2 thought it was down?
3. The bson files that were output only had newly created documents. What happened to any updates that happened in the rolled-back period? Are they just gone forever?

related to

SERVER-9339 rollback files missing after rollback

Closed

SERVER-9730 Asymmetrical network partition can cause the election of two PRIMARY nodes

Closed

SERVER-9283 lastHeartbeatRecv flaps between invalid and valid dates

Closed

Details

Description

Attachments

Issue Links

Activity

People

Dates