-
Type: Improvement
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.0.2
-
Component/s: Networking
-
None
-
Environment:Linux 2.6.32-220.4.1.el6.i686 #1 SMP Mon Jan 23 17:25:22 CST 2012 i686 i686 i386 GNU/Linux
We are trying to simulate network split (partition) on a Mongo 2.0.2
replica set consisting of three nodes. Basically we DROP all packets
between PRIMARY and SLAVES.
PRIMARY = lk-mm1 SECONDARY1 = lk-mm2 SECONDARY2 = lk-mm4
On primary:
iptables -A INPUT --src lk-mm4 -j DROP iptables -A OUTPUT --dst lk-mm4 -j DROP iptables -A INPUT --src lk-mm2 -j DROP iptables -A OUTPUT --dst lk-mm2 -j DROP
The primary server correctly steps down, one of the secondaries
becomes master.
The problem is that the other secondary still tries to read oplog from
the late PRIMARY. The timeout kicks-in after long ~15 minutes. Since we are using writeConcern=2, the replica set does not accept writes for quite a long time.
Thu Feb 9 14:07:34 [rsSync] replSet syncing to: lk-mm1:27017 Thu Feb 9 14:08:12 [rsHealthPoll] DBClientCursor::init call() failed Thu Feb 9 14:08:12 [rsHealthPoll] replSet info lk-mm1:27017 is down (or slow to respond): DBClientBase::findN: transport error: lk-mm1:27017 query: { replSetHeartbeat: "gdc", v: 3, pv: 1, checkEmpty: false, from: "lk-mm2:27017" } Thu Feb 9 14:08:12 [rsHealthPoll] replSet member lk-mm1:27017 is now in state DOWN Thu Feb 9 14:08:12 [rsMgr] not electing self, lk-mm4:27017 would veto Thu Feb 9 14:08:12 [conn597] replSet info voting yea for lk-mm4:27017 (2) Thu Feb 9 14:08:13 [rsHealthPoll] replSet member lk-mm4:27017 is now in state PRIMARY Thu Feb 9 14:08:24 [rsHealthPoll] couldn't connect to lk-mm1: couldn't connect to server lk-mm1:27017 Thu Feb 9 14:11:34 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017 Thu Feb 9 14:14:44 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017 Thu Feb 9 14:17:54 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017 Thu Feb 9 14:21:04 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017 Thu Feb 9 14:24:14 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017 Thu Feb 9 14:24:15 [rsSync] Socket recv() errno:110 Connection timed out 10.244.123.13:27017 Thu Feb 9 14:24:15 [rsSync] SocketException: remote: 10.244.123.13:27017 error: 9001 socket exception [1] server [10.244.123.13:27017] Thu Feb 9 14:24:15 [rsSync] Socket flush send() errno:32 Broken pipe 10.244.123.13:27017 Thu Feb 9 14:24:15 [rsSync] caught exception (socket exception) in destructor (~PiggyBackData) Thu Feb 9 14:24:15 [rsSync] replSet syncThread: 10278 dbclient error communicating with server: lk-mm1:27017 Thu Feb 9 14:24:26 [rsSync] replSet syncing to: lk-mm4:27017
See the user group for mere details https://groups.google.com/group/mongodb-user/browse_thread/thread/935bdbd868d8ff1d
- duplicates
-
SERVER-4758 OplogReader has no socket timeout
- Closed