[SERVER-6300] Node 2 is master but attempts to replicate from Node 3 Created: 04/Jul/12  Updated: 11/Jul/16  Resolved: 29/Aug/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tuure Laurinolli Assignee: Kristina Chodorow (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File br0.pcap    
Issue Links:
Related
related to SERVER-6733 Make oplog timeout shorter Closed
Operating System: ALL
Participants:

 Description   

I have been running some replication tests, and found something quite puzzling regarding failovers.

Looking at packet capture off the bridge interface (attached) and log files (snippets attached): After failure on network connection of Node 1 (which was master, last successful communications around packet 2414), Node 2 became master (2566 seems to be the first heartbeat with Node 2 as master). The client eventually timed out as expected. When the client attempted to perform the next update on Node 2 (packets 3271 and 3273), the write timed out. It appears that Node 2 thinks it is master, but instead of Node 3 replicating from Node 2, Node 2 is instead asking for replication updated from Node 3 (e.g. 3288). I think this is a bug.

The replication setup is a standard 3-node replica set. The nodes run in separate VirtualBox VMs with bridged networking. The network interfaces are connected to TAP interfaces on the host and connected together using a bridge device. In the test in question, a single client running on the host updates numbers in collection 'test' on the cluster one at a time. A failure is introduced by killing network connection of one of the VMs by setting packet loss on its TAP interface to 100% with netem.

Snippet from mongodb.log of Node 2 when it became master:
Wed Jul 4 16:44:37 [rsHealthPoll] DBClientCursor::init call() failed
Wed Jul 4 16:44:37 [rsHealthPoll] replSet info test1:27017 is down (or slow to respond): DBClientBase::findN: transport error: test1:27017 query:

{ replSetHeartbeat: "test", v: 1, pv: 1, checkEmpty: false, from: "test2:27017" }

Wed Jul 4 16:44:37 [rsHealthPoll] replSet member test1:27017 is now in state DOWN
Wed Jul 4 16:44:37 [rsMgr] replSet info electSelf 2
Wed Jul 4 16:44:37 [rsMgr] replSet PRIMARY

Snippet from mongodb.log of Node 3 when it formed new cluster with Node 2:
Wed Jul 4 16:44:37 [conn190] replSet info voting yea for test2:27017 (2)
Wed Jul 4 16:44:37 [rsHealthPoll] couldn't connect to test1:27017: couldn't connect to server test1:27017
Wed Jul 4 16:44:38 [rsHealthPoll] replSet member test2:27017 is now in state PRIMARY
Wed Jul 4 16:44:38 [rsMgr] replSet info two primaries (transiently)
Wed Jul 4 16:44:44 [rsMgr] replSet info two primaries (transiently)
Wed Jul 4 16:44:47 [rsHealthPoll] replSet info test1:27017 is down (or slow to respond): socket exception
Wed Jul 4 16:44:47 [rsHealthPoll] replSet member test1:27017 is now in state DOWN



 Comments   
Comment by Kristina Chodorow (Inactive) [ 29/Aug/12 ]

Please leave a comment if you have other concerns, otherwise I'll assume discussion will move to SERVER-6733.

Comment by Kristina Chodorow (Inactive) [ 08/Aug/12 ]

The timeout was raised because it was causing spurious timeouts in tests. It was a "magic number" choice, though.

I see what you mean about SERVER-5208. Lowering the timeout might be worth revisiting post-2.2, as there have been substantial changes in the replication code that should make shorter timeouts work better. I created a ticket for this: SERVER-6733.

Comment by Tuure Laurinolli [ 01/Aug/12 ]

Revisiting this now that I got back to MongoDB in my testing - the default timeout of 10 minutes seems rather high. Contrast with the timeout of 20 seconds set as solution to SERVER-4918. Is there actually some rationale behind having the timeout be in the order of minutes instead of seconds?

The commits in SERVER-4758 mention raising it from 60 seconds to 10 minutes so presumably there are some bad experiences behind the change. As it is, having the cluster be unwritable (with w:2) for 10 minutes after a network partition seems rather silly. I'm not sure how SERVER-5208 would help either, unless it would somehow unblock the replication connection.

Comment by Kristina Chodorow (Inactive) [ 16/Jul/12 ]

You could file a feature request to make it tunable, but it's unlikely to be accepted because MongoDB's general philosophy on the subject is "no knobs" (so it's difficult get a new one).

It seems reasonable to me that changes in the set would cause a re-evaluation of replication target, I added a comment about it to SERVER-5208, which seems like the relevant issue.

Comment by Tuure Laurinolli [ 16/Jul/12 ]

Could the timeout be made tunable? Or better yet, could change in cluster configuration interrupt the blocking replication reads, causing them to take updated cluster configuration into account?

Comment by Kristina Chodorow (Inactive) [ 13/Jul/12 ]

Unfortunately no, unless you recompile :-/

Comment by Tuure Laurinolli [ 13/Jul/12 ]

At 16:47 I brought Node 1 back up. Is the timeout tunable? I would like something in the order of seconds.

Comment by Kristina Chodorow (Inactive) [ 13/Jul/12 ]

Hmm, looks like we added a timeout to the replication thread in 2.0.6 (SERVER-4758) that is set for 10 minutes. Did it figure out who to sync from after that?

Comment by Tuure Laurinolli [ 11/Jul/12 ]

It appears that Node 3 never started syncin from Node2:

Wed Jul 4 16:44:37 [conn190] replSet info voting yea for test2:27017 (2)
Wed Jul 4 16:44:37 [rsHealthPoll] couldn't connect to test1:27017: couldn't connect to server test1:27017
Wed Jul 4 16:44:38 [rsHealthPoll] replSet member test2:27017 is now in state PRIMARY
Wed Jul 4 16:44:38 [rsMgr] replSet info two primaries (transiently)
Wed Jul 4 16:44:44 [rsMgr] replSet info two primaries (transiently)
Wed Jul 4 16:44:47 [rsHealthPoll] replSet info test1:27017 is down (or slow to respond): socket exception
Wed Jul 4 16:44:47 [rsHealthPoll] replSet member test1:27017 is now in state DOWN
Wed Jul 4 16:44:50 [conn191] end connection 192.168.1.1:42628
Wed Jul 4 16:44:52 [conn190] end connection 192.168.1.2:44767
Wed Jul 4 16:44:52 [initandlisten] connection accepted from 192.168.1.2:44771 #192
Wed Jul 4 16:45:12 [initandlisten] connection accepted from 192.168.1.100:35964 #193
Wed Jul 4 16:45:22 [conn192] end connection 192.168.1.2:44771
Wed Jul 4 16:45:22 [initandlisten] connection accepted from 192.168.1.2:44774 #194
Wed Jul 4 16:45:52 [conn194] end connection 192.168.1.2:44774
Wed Jul 4 16:45:52 [initandlisten] connection accepted from 192.168.1.2:44778 #195
Wed Jul 4 16:46:22 [conn195] end connection 192.168.1.2:44778
Wed Jul 4 16:46:22 [initandlisten] connection accepted from 192.168.1.2:44781 #196
Wed Jul 4 16:46:52 [conn196] end connection 192.168.1.2:44781
Wed Jul 4 16:46:52 [initandlisten] connection accepted from 192.168.1.2:44785 #197
Wed Jul 4 16:47:00 [rsHealthPoll] replSet member test1:27017 is up
Wed Jul 4 16:47:00 [rsHealthPoll] replSet member test1:27017 is now in state SECONDARY

This is also clear from the packet capture. For example at 16:45:01 (packet 2894) Node 2 is still trying to replicate from Node 3. The packet capture also shows TCP retransmissions of Get More requests from Node 3 to Node 1 tens of seconds after Node 1 has stopped responding (e.g. packet 3139, 16:45:22). Perhaps the replication connections do not time out for some reason?

Comment by Kristina Chodorow (Inactive) [ 11/Jul/12 ]

> How long time should it take before the replication thread notices that it is primary?

Generally less than a second. The other members will usually take longer to notice (usually less than 2 seconds but up to 20 seconds in the worst case).

> how long before the replication thread on Node 3 notices that it should be replicating
> from Node 2 (as it doesn't seem to be, if you look at the packet capture)?

Once it realizes Node 1 is down and Node 2 is primary. You should be able to see when Node 3 starts syncing by looking at its log. It'll have the DOWN message you pasted above and then (probably a few lines below?) should have a "syncing to: host:port" line when it starts syncing from Node 2.

During various failure scenarios, you may have w:2 timeout depending on how long your timeout is.

Comment by Tuure Laurinolli [ 11/Jul/12 ]

How long time should it take before the replication thread notices that it is primary? And more importantly, how long before the replication thread on Node 3 notices that it should be replicating from Node 2 (as it doesn't seem to be, if you look at the packet capture)?

The issue is that writes with w:2 time out because replication from Node 2 to Node 3 does not happen.

Comment by Kristina Chodorow (Inactive) [ 11/Jul/12 ]

This is by design. Replication is on a separate thread from heartbeats, so there may be some delay before the replication thread notices that it is primary. However, the replication thread is careful to check that it is not primary before actually applying ops, see https://github.com/mongodb/mongo/blob/v2.0/db/repl/rs_sync.cpp#L427. (syncApply() is the function that actually applies the operation to the member.)

Are you having an issue that you thought was caused by this or were you just concerned about the network traffic you observed?

Generated at Thu Feb 08 03:11:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.