Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.0.6
Component/s: Replication
Labels:
None

Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

I have been running some replication tests, and found something quite puzzling regarding failovers.

Looking at packet capture off the bridge interface (attached) and log files (snippets attached): After failure on network connection of Node 1 (which was master, last successful communications around packet 2414), Node 2 became master (2566 seems to be the first heartbeat with Node 2 as master). The client eventually timed out as expected. When the client attempted to perform the next update on Node 2 (packets 3271 and 3273), the write timed out. It appears that Node 2 thinks it is master, but instead of Node 3 replicating from Node 2, Node 2 is instead asking for replication updated from Node 3 (e.g. 3288). I think this is a bug.

The replication setup is a standard 3-node replica set. The nodes run in separate VirtualBox VMs with bridged networking. The network interfaces are connected to TAP interfaces on the host and connected together using a bridge device. In the test in question, a single client running on the host updates numbers in collection 'test' on the cluster one at a time. A failure is introduced by killing network connection of one of the VMs by setting packet loss on its TAP interface to 100% with netem.

Snippet from mongodb.log of Node 2 when it became master:
Wed Jul 4 16:44:37 [rsHealthPoll] DBClientCursor::init call() failed
Wed Jul 4 16:44:37 [rsHealthPoll] replSet info test1:27017 is down (or slow to respond): DBClientBase::findN: transport error: test1:27017 query:

{ replSetHeartbeat: "test", v: 1, pv: 1, checkEmpty: false, from: "test2:27017" }

Wed Jul 4 16:44:37 [rsHealthPoll] replSet member test1:27017 is now in state DOWN
Wed Jul 4 16:44:37 [rsMgr] replSet info electSelf 2
Wed Jul 4 16:44:37 [rsMgr] replSet PRIMARY

Snippet from mongodb.log of Node 3 when it formed new cluster with Node 2:
Wed Jul 4 16:44:37 [conn190] replSet info voting yea for test2:27017 (2)
Wed Jul 4 16:44:37 [rsHealthPoll] couldn't connect to test1:27017: couldn't connect to server test1:27017
Wed Jul 4 16:44:38 [rsHealthPoll] replSet member test2:27017 is now in state PRIMARY
Wed Jul 4 16:44:38 [rsMgr] replSet info two primaries (transiently)
Wed Jul 4 16:44:44 [rsMgr] replSet info two primaries (transiently)
Wed Jul 4 16:44:47 [rsHealthPoll] replSet info test1:27017 is down (or slow to respond): socket exception
Wed Jul 4 16:44:47 [rsHealthPoll] replSet member test1:27017 is now in state DOWN

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

br0.pcap
Jul 04 2012 03:13:42 PM UTC
675 kB
Tuure Laurinolli

related to

SERVER-6733 Make oplog timeout shorter

Closed

Assignee:: Kristina Chodorow (Inactive)
Reporter:: Tuure Laurinolli
Participants:: Kristina Chodorow, Tuure Laurinolli
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Jul 04 2012 03:13:42 PM UTC
Updated:: Jul 11 2016 06:34:15 PM UTC
Resolved:: Aug 29 2012 07:13:47 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

PagerDuty