Details
-
Improvement
-
Status: Closed
-
Major - P3
-
Resolution: Duplicate
-
2.0.2
-
None
-
None
-
Linux 2.6.32-220.4.1.el6.i686 #1 SMP Mon Jan 23 17:25:22 CST 2012 i686 i686 i386 GNU/Linux
Description
We are trying to simulate network split (partition) on a Mongo 2.0.2
replica set consisting of three nodes. Basically we DROP all packets
between PRIMARY and SLAVES.
PRIMARY = lk-mm1
|
SECONDARY1 = lk-mm2
|
SECONDARY2 = lk-mm4
|
On primary:
iptables -A INPUT --src lk-mm4 -j DROP
|
iptables -A OUTPUT --dst lk-mm4 -j DROP
|
iptables -A INPUT --src lk-mm2 -j DROP
|
iptables -A OUTPUT --dst lk-mm2 -j DROP
|
The primary server correctly steps down, one of the secondaries
becomes master.
The problem is that the other secondary still tries to read oplog from
the late PRIMARY. The timeout kicks-in after long ~15 minutes. Since we are using writeConcern=2, the replica set does not accept writes for quite a long time.
Thu Feb 9 14:07:34 [rsSync] replSet syncing to: lk-mm1:27017
|
Thu Feb 9 14:08:12 [rsHealthPoll] DBClientCursor::init call() failed
|
Thu Feb 9 14:08:12 [rsHealthPoll] replSet info lk-mm1:27017 is down
|
(or slow to respond): DBClientBase::findN: transport error: lk-mm1:27017 query: { replSetHeartbeat: "gdc", v: 3, pv: 1, checkEmpty: false, from: "lk-mm2:27017" }
|
Thu Feb 9 14:08:12 [rsHealthPoll] replSet member lk-mm1:27017 is now in state DOWN
|
Thu Feb 9 14:08:12 [rsMgr] not electing self, lk-mm4:27017 would veto
|
Thu Feb 9 14:08:12 [conn597] replSet info voting yea for lk-mm4:27017 (2)
|
Thu Feb 9 14:08:13 [rsHealthPoll] replSet member lk-mm4:27017 is now in state PRIMARY
|
Thu Feb 9 14:08:24 [rsHealthPoll] couldn't connect to lk-mm1: couldn't connect to server lk-mm1:27017
|
Thu Feb 9 14:11:34 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
|
Thu Feb 9 14:14:44 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
|
Thu Feb 9 14:17:54 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
|
Thu Feb 9 14:21:04 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
|
Thu Feb 9 14:24:14 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
|
Thu Feb 9 14:24:15 [rsSync] Socket recv() errno:110 Connection timed out 10.244.123.13:27017
|
Thu Feb 9 14:24:15 [rsSync] SocketException: remote: 10.244.123.13:27017 error: 9001 socket exception [1] server [10.244.123.13:27017]
|
Thu Feb 9 14:24:15 [rsSync] Socket flush send() errno:32 Broken pipe 10.244.123.13:27017
|
Thu Feb 9 14:24:15 [rsSync] caught exception (socket exception) in destructor (~PiggyBackData)
|
Thu Feb 9 14:24:15 [rsSync] replSet syncThread: 10278 dbclient error communicating with server: lk-mm1:27017
|
Thu Feb 9 14:24:26 [rsSync] replSet syncing to: lk-mm4:27017
|
See the user group for mere details https://groups.google.com/group/mongodb-user/browse_thread/thread/935bdbd868d8ff1d
Attachments
Issue Links
- duplicates
-
SERVER-4758 OplogReader has no socket timeout
-
- Closed
-