Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.0.2
Component/s: Networking
Labels:
None
Environment:
Linux 2.6.32-220.4.1.el6.i686 #1 SMP Mon Jan 23 17:25:22 CST 2012 i686 i686 i386 GNU/Linux

Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We are trying to simulate network split (partition) on a Mongo 2.0.2
replica set consisting of three nodes. Basically we DROP all packets
between PRIMARY and SLAVES.

PRIMARY = lk-mm1
SECONDARY1 = lk-mm2
SECONDARY2 = lk-mm4

On primary:

iptables -A INPUT --src lk-mm4 -j DROP
iptables -A OUTPUT --dst lk-mm4 -j DROP
iptables -A INPUT --src lk-mm2 -j DROP
iptables -A OUTPUT --dst lk-mm2 -j DROP

The primary server correctly steps down, one of the secondaries
becomes master.

The problem is that the other secondary still tries to read oplog from
the late PRIMARY. The timeout kicks-in after long ~15 minutes. Since we are using writeConcern=2, the replica set does not accept writes for quite a long time.

Thu Feb  9 14:07:34 [rsSync] replSet syncing to: lk-mm1:27017
Thu Feb  9 14:08:12 [rsHealthPoll] DBClientCursor::init call() failed
Thu Feb  9 14:08:12 [rsHealthPoll] replSet info lk-mm1:27017 is down
(or slow to respond): DBClientBase::findN: transport error: lk-mm1:27017 query: { replSetHeartbeat: "gdc", v: 3, pv: 1, checkEmpty: false, from: "lk-mm2:27017" }
Thu Feb  9 14:08:12 [rsHealthPoll] replSet member lk-mm1:27017 is now in state DOWN
Thu Feb  9 14:08:12 [rsMgr] not electing self, lk-mm4:27017 would veto
Thu Feb  9 14:08:12 [conn597] replSet info voting yea for lk-mm4:27017 (2)
Thu Feb  9 14:08:13 [rsHealthPoll] replSet member lk-mm4:27017 is now in state PRIMARY
Thu Feb  9 14:08:24 [rsHealthPoll] couldn't connect to lk-mm1: couldn't connect to server lk-mm1:27017
Thu Feb  9 14:11:34 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017 
Thu Feb  9 14:14:44 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
Thu Feb  9 14:17:54 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
Thu Feb  9 14:21:04 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
Thu Feb  9 14:24:14 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
Thu Feb  9 14:24:15 [rsSync] Socket recv() errno:110 Connection timed out 10.244.123.13:27017
Thu Feb  9 14:24:15 [rsSync] SocketException: remote: 10.244.123.13:27017 error: 9001 socket exception [1] server [10.244.123.13:27017]
Thu Feb  9 14:24:15 [rsSync] Socket flush send() errno:32 Broken pipe 10.244.123.13:27017
Thu Feb  9 14:24:15 [rsSync]   caught exception (socket exception) in destructor (~PiggyBackData)
Thu Feb  9 14:24:15 [rsSync] replSet syncThread: 10278 dbclient error communicating with server: lk-mm1:27017
Thu Feb  9 14:24:26 [rsSync] replSet syncing to: lk-mm4:27017

See the user group for mere details https://groups.google.com/group/mongodb-user/browse_thread/thread/935bdbd868d8ff1d

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

timeout-patch.patch
0.9 kB
Feb 10 2012 11:48:46 AM UTC

duplicates

SERVER-4758 OplogReader has no socket timeout

Closed

Assignee:: Kristina Chodorow (Inactive)
Reporter:: Lukas Krecan
Participants:: Kristina Chodorow, Lukas Krecan
Votes:: 0 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Feb 09 2012 02:20:56 PM UTC
Updated:: Feb 29 2012 03:54:02 AM UTC
Resolved:: Feb 10 2012 02:43:56 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates