[SERVER-4918] Lower replica set reader timeout (or make it configurable) Created: 09/Feb/12  Updated: 29/Feb/12  Resolved: 10/Feb/12

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: 2.0.2
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Lukas Krecan Assignee: Kristina Chodorow (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux 2.6.32-220.4.1.el6.i686 #1 SMP Mon Jan 23 17:25:22 CST 2012 i686 i686 i386 GNU/Linux


Attachments: Text File timeout-patch.patch    
Issue Links:
Duplicate
duplicates SERVER-4758 OplogReader has no socket timeout Closed
Participants:

 Description   

We are trying to simulate network split (partition) on a Mongo 2.0.2
replica set consisting of three nodes. Basically we DROP all packets
between PRIMARY and SLAVES.

PRIMARY = lk-mm1
SECONDARY1 = lk-mm2
SECONDARY2 = lk-mm4

On primary:

iptables -A INPUT --src lk-mm4 -j DROP
iptables -A OUTPUT --dst lk-mm4 -j DROP
iptables -A INPUT --src lk-mm2 -j DROP
iptables -A OUTPUT --dst lk-mm2 -j DROP

The primary server correctly steps down, one of the secondaries
becomes master.

The problem is that the other secondary still tries to read oplog from
the late PRIMARY. The timeout kicks-in after long ~15 minutes. Since we are using writeConcern=2, the replica set does not accept writes for quite a long time.

Thu Feb  9 14:07:34 [rsSync] replSet syncing to: lk-mm1:27017
Thu Feb  9 14:08:12 [rsHealthPoll] DBClientCursor::init call() failed
Thu Feb  9 14:08:12 [rsHealthPoll] replSet info lk-mm1:27017 is down
(or slow to respond): DBClientBase::findN: transport error: lk-mm1:27017 query: { replSetHeartbeat: "gdc", v: 3, pv: 1, checkEmpty: false, from: "lk-mm2:27017" }
Thu Feb  9 14:08:12 [rsHealthPoll] replSet member lk-mm1:27017 is now in state DOWN
Thu Feb  9 14:08:12 [rsMgr] not electing self, lk-mm4:27017 would veto
Thu Feb  9 14:08:12 [conn597] replSet info voting yea for lk-mm4:27017 (2)
Thu Feb  9 14:08:13 [rsHealthPoll] replSet member lk-mm4:27017 is now in state PRIMARY
Thu Feb  9 14:08:24 [rsHealthPoll] couldn't connect to lk-mm1: couldn't connect to server lk-mm1:27017
Thu Feb  9 14:11:34 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017 
Thu Feb  9 14:14:44 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
Thu Feb  9 14:17:54 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
Thu Feb  9 14:21:04 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
Thu Feb  9 14:24:14 [rsHealthPoll] couldn't connect to lk-mm1:27017: couldn't connect to server lk-mm1:27017
Thu Feb  9 14:24:15 [rsSync] Socket recv() errno:110 Connection timed out 10.244.123.13:27017
Thu Feb  9 14:24:15 [rsSync] SocketException: remote: 10.244.123.13:27017 error: 9001 socket exception [1] server [10.244.123.13:27017]
Thu Feb  9 14:24:15 [rsSync] Socket flush send() errno:32 Broken pipe 10.244.123.13:27017
Thu Feb  9 14:24:15 [rsSync]   caught exception (socket exception) in destructor (~PiggyBackData)
Thu Feb  9 14:24:15 [rsSync] replSet syncThread: 10278 dbclient error communicating with server: lk-mm1:27017
Thu Feb  9 14:24:26 [rsSync] replSet syncing to: lk-mm4:27017 

See the user group for mere details https://groups.google.com/group/mongodb-user/browse_thread/thread/935bdbd868d8ff1d



 Comments   
Comment by Lukas Krecan [ 10/Feb/12 ]

This patch helps

Comment by Lukas Krecan [ 09/Feb/12 ]

Better solution might be to stop reading from a server that is down an start reading from another one.

Generated at Thu Feb 08 03:07:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.