[SERVER-6300] Node 2 is master but attempts to replicate from Node 3 Created: 04/Jul/12 Updated: 11/Jul/16 Resolved: 29/Aug/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tuure Laurinolli | Assignee: | Kristina Chodorow (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
I have been running some replication tests, and found something quite puzzling regarding failovers. Looking at packet capture off the bridge interface (attached) and log files (snippets attached): After failure on network connection of Node 1 (which was master, last successful communications around packet 2414), Node 2 became master (2566 seems to be the first heartbeat with Node 2 as master). The client eventually timed out as expected. When the client attempted to perform the next update on Node 2 (packets 3271 and 3273), the write timed out. It appears that Node 2 thinks it is master, but instead of Node 3 replicating from Node 2, Node 2 is instead asking for replication updated from Node 3 (e.g. 3288). I think this is a bug. The replication setup is a standard 3-node replica set. The nodes run in separate VirtualBox VMs with bridged networking. The network interfaces are connected to TAP interfaces on the host and connected together using a bridge device. In the test in question, a single client running on the host updates numbers in collection 'test' on the cluster one at a time. A failure is introduced by killing network connection of one of the VMs by setting packet loss on its TAP interface to 100% with netem. Snippet from mongodb.log of Node 2 when it became master: Wed Jul 4 16:44:37 [rsHealthPoll] replSet member test1:27017 is now in state DOWN Snippet from mongodb.log of Node 3 when it formed new cluster with Node 2: |
| Comments |
| Comment by Kristina Chodorow (Inactive) [ 29/Aug/12 ] |
|
Please leave a comment if you have other concerns, otherwise I'll assume discussion will move to |
| Comment by Kristina Chodorow (Inactive) [ 08/Aug/12 ] |
|
The timeout was raised because it was causing spurious timeouts in tests. It was a "magic number" choice, though. I see what you mean about |
| Comment by Tuure Laurinolli [ 01/Aug/12 ] |
|
Revisiting this now that I got back to MongoDB in my testing - the default timeout of 10 minutes seems rather high. Contrast with the timeout of 20 seconds set as solution to The commits in |
| Comment by Kristina Chodorow (Inactive) [ 16/Jul/12 ] |
|
You could file a feature request to make it tunable, but it's unlikely to be accepted because MongoDB's general philosophy on the subject is "no knobs" (so it's difficult get a new one). It seems reasonable to me that changes in the set would cause a re-evaluation of replication target, I added a comment about it to |
| Comment by Tuure Laurinolli [ 16/Jul/12 ] |
|
Could the timeout be made tunable? Or better yet, could change in cluster configuration interrupt the blocking replication reads, causing them to take updated cluster configuration into account? |
| Comment by Kristina Chodorow (Inactive) [ 13/Jul/12 ] |
|
Unfortunately no, unless you recompile :-/ |
| Comment by Tuure Laurinolli [ 13/Jul/12 ] |
|
At 16:47 I brought Node 1 back up. Is the timeout tunable? I would like something in the order of seconds. |
| Comment by Kristina Chodorow (Inactive) [ 13/Jul/12 ] |
|
Hmm, looks like we added a timeout to the replication thread in 2.0.6 ( |
| Comment by Tuure Laurinolli [ 11/Jul/12 ] |
|
It appears that Node 3 never started syncin from Node2: Wed Jul 4 16:44:37 [conn190] replSet info voting yea for test2:27017 (2) This is also clear from the packet capture. For example at 16:45:01 (packet 2894) Node 2 is still trying to replicate from Node 3. The packet capture also shows TCP retransmissions of Get More requests from Node 3 to Node 1 tens of seconds after Node 1 has stopped responding (e.g. packet 3139, 16:45:22). Perhaps the replication connections do not time out for some reason? |
| Comment by Kristina Chodorow (Inactive) [ 11/Jul/12 ] |
|
> How long time should it take before the replication thread notices that it is primary? Generally less than a second. The other members will usually take longer to notice (usually less than 2 seconds but up to 20 seconds in the worst case). > how long before the replication thread on Node 3 notices that it should be replicating Once it realizes Node 1 is down and Node 2 is primary. You should be able to see when Node 3 starts syncing by looking at its log. It'll have the DOWN message you pasted above and then (probably a few lines below?) should have a "syncing to: host:port" line when it starts syncing from Node 2. During various failure scenarios, you may have w:2 timeout depending on how long your timeout is. |
| Comment by Tuure Laurinolli [ 11/Jul/12 ] |
|
How long time should it take before the replication thread notices that it is primary? And more importantly, how long before the replication thread on Node 3 notices that it should be replicating from Node 2 (as it doesn't seem to be, if you look at the packet capture)? The issue is that writes with w:2 time out because replication from Node 2 to Node 3 does not happen. |
| Comment by Kristina Chodorow (Inactive) [ 11/Jul/12 ] |
|
This is by design. Replication is on a separate thread from heartbeats, so there may be some delay before the replication thread notices that it is primary. However, the replication thread is careful to check that it is not primary before actually applying ops, see https://github.com/mongodb/mongo/blob/v2.0/db/repl/rs_sync.cpp#L427. (syncApply() is the function that actually applies the operation to the member.) Are you having an issue that you thought was caused by this or were you just concerned about the network traffic you observed? |