[SERVER-9682] Fatal assertion crash on replica during initial sync Created: 14/May/13 Updated: 10/Dec/14 Resolved: 06/Dec/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | George P. Stathis | Assignee: | Daniel Pasette (Inactive) |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | ec2 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Primary ubuntu 10.04 |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
Have been trying to upgrade our QA from 2.2.2 to 2.4.3. Historically, our upgrades have gone very smoothly. This one is really giving us some trouble. We first disabled replication on QA and just upgraded master to 2.4.3 to test it out. When satisfied, we upgraded the two replicas and let them catch up but they kept failing very similarly to
We have even simplified the setup to one replica trying to resync from master so that we reduce the load on the network. Same issue. I'll be attaching the full logs from both master and the crashing secondary in a bit. Following the suggestions in the other tickets, I also ran tcpdump during the tests to capture any possible network issues. Those logs are huge so I'm trying to truncate them to the time of the crash so that I can upload them. |
| Comments |
| Comment by George P. Stathis [ 26/Jun/13 ] | ||||||||||||||||||||||||||||||||||||
|
Just reading this now. Thanks Dan, I'll give it a shot and post back here with our findings. | ||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 10/Jun/13 ] | ||||||||||||||||||||||||||||||||||||
|
Hi George, Looked a little further and I think I see what's happening and can explain the difference between 2.2 and 2.4. The short answer is that a 3rd node should actually help (either arbiter or data node) keep the replica set stable during initial sync. Longer answer. There seem to be load problems on the secondary, caused by the initial sync, which is causing the primary to think that the secondary is down. In a two node replica set, there is no possible majority, which causes the primary to step down, which automatically severs the initial sync connection, which causes the initial sync to restart from scratch. You can see all the times it restarted. Sometimes it got 2 hours in, sometimes it only made it 3 minutes:
There were some changes in 2.4 which should help improve stability of the set as a whole, and should prevent flapping (what you're seeing here). Heartbeat connections are retried and replica set state is checked symmetrically; that is, each node checks not only that their replica set heartbeat is responded to, but also whether it has not received a heartbeat from a node before marking it as down. There were also some changes to make replica set failover faster in 2.4. This is a good thing, but in a 2 node replica set, it can make the set overly sensitive to network blips. There is a lot more work to be done on making initial sync tolerant of this kind of failure, but I'd be interested if you could retry this with an arbiter or 3rd set member. | ||||||||||||||||||||||||||||||||||||
| Comment by George P. Stathis [ 28/May/13 ] | ||||||||||||||||||||||||||||||||||||
|
Hi Dan, Regarding the read-ahead, we have since corrected this and set it to 16K as recommended in http://docs.mongodb.org/manual/administration/production-notes/#readahead and http://www.10gen.com/blog/post/provisioned-iops-aws-marketplace-significantly-boosts-mongodb-performance-ease-use . Regarding the arbiter, we were originally running this RS with three members, one master, one secondary and one non-eligible box for backups. We were seeing the same issue with all three nodes up so we decided to remove the non-eligible box in an attempt to alleviate any assumed network load. This didn't help either. Given this, do you still think we should try an arbiter with no data? Interestingly, we have reverted all RS members back to version 2.2.2 and performed a full-resync without a problem: this issue does not occur on 2.2.2 for us. Same boxes, same network, same RS config. This issue only appeared when we updated to 2.4.3. | ||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 28/May/13 ] | ||||||||||||||||||||||||||||||||||||
|
Hi George, First thing I notice is the warning to tell you to turn your readahead setting down:
This is probably not your current issue, but wanted to mention it. Next, I see lots of network errors throughout the sync process:
and then finally, it tried to connect 10 times and eventually gave up, causing it to fassert and abort the replication attempt (see the "0 attempts remaining" log line):
On the primary, at the time of the failure, it looks like you are building an index in the foreground at about the same time as the final failure to connect. This shouldn't cause an initial sync failure, but definitely adds load to your primary. Have you tried adding an arbiter to your cluster to try and cut down the flapping you're seeing? | ||||||||||||||||||||||||||||||||||||
| Comment by George P. Stathis [ 28/May/13 ] | ||||||||||||||||||||||||||||||||||||
|
Hey guys, has anyone had a chance to take a look at this? Still no luck on our side here. | ||||||||||||||||||||||||||||||||||||
| Comment by George P. Stathis [ 14/May/13 ] | ||||||||||||||||||||||||||||||||||||
|
Also:
which we understand is the recommended setting for EC2: http://docs.mongodb.org/ecosystem/platforms/amazon-ec2/ | ||||||||||||||||||||||||||||||||||||
| Comment by George P. Stathis [ 14/May/13 ] | ||||||||||||||||||||||||||||||||||||
|
Adding the last 9 minutes or so of the tcpdump activity on port 27017 on secondary before the crash occurs. |