[SERVER-9682] Fatal assertion crash on replica during initial sync Created: 14/May/13  Updated: 10/Dec/14  Resolved: 06/Dec/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: George P. Stathis Assignee: Daniel Pasette (Inactive)
Resolution: Incomplete Votes: 0
Labels: ec2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Primary ubuntu 10.04
Secondary ubuntu 12.04
AWS


Attachments: File mongodb.logs.tar.gz     File tcpdump.secondary.tar.gz    
Operating System: Linux
Participants:

 Description   

Have been trying to upgrade our QA from 2.2.2 to 2.4.3. Historically, our upgrades have gone very smoothly. This one is really giving us some trouble. We first disabled replication on QA and just upgraded master to 2.4.3 to test it out. When satisfied, we upgraded the two replicas and let them catch up but they kept failing very similarly to SERVER-6975. So we simplified and attempted a full resync instead. What we observe now is the replicas losing connectivity with master several times during the resync process and eventually crashing in the same manner as reported in SERVER-9057 and SERVER-9199:

 
Mon May 13 19:07:45.614 [rsHealthPoll] replSet member ip-10-71-26-90.ec2.internal:27017 is now in state SECONDARY
Mon May 13 19:07:45.700 [initandlisten] connection accepted from 10.71.26.90:50732 #2175 (28 connections now open)
Mon May 13 19:07:45.703 [conn2175]  authenticate db: local { authenticate: 1, nonce: "779ae9aba7a706e4", user: "__system", key: "e7f4453059498c244523f2794a04c216" }
Mon May 13 19:07:45.705 [conn2175] replSet info voting yea for ip-10-71-26-90.ec2.internal:27017 (0)
Mon May 13 19:07:47.617 [rsHealthPoll] replSet member ip-10-71-26-90.ec2.internal:27017 is now in state PRIMARY
Mon May 13 19:07:59.673 [initandlisten] connection accepted from 10.71.26.90:50738 #2176 (29 connections now open)
Mon May 13 19:07:59.677 [conn2176]  authenticate db: admin { authenticate: 1, user: "mongo", nonce: "8f94b9003dc3c980", key: "ad59ed669b28d961e6db34033c785980" }
Mon May 13 19:07:59.691 [conn2176] end connection 10.71.26.90:50738 (28 connections now open)
Mon May 13 19:08:13.751 [conn2175] end connection 10.71.26.90:50732 (27 connections now open)
Mon May 13 19:08:13.753 [initandlisten] connection accepted from 10.71.26.90:50749 #2177 (28 connections now open)
Mon May 13 19:08:13.755 [conn2177]  authenticate db: local { authenticate: 1, nonce: "f4fb648338aebbc9", user: "__system", key: "77d24c9ffe7f0e3fe2749d1c042562d0" }
Mon May 13 19:08:15.606 [rsSync]   Fatal Assertion 16233
0xdcf361 0xd8f0d3 0xc03b0f 0xc21811 0xc218ad 0xc21b7c 0xe17cb9 0x7f313696ce9a 0x7f3135c7fcbd
 /home/mongo/mongodb/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd8f0d3]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xc03b0f]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc21811]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc218ad]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc21b7c]
 /home/mongo/mongodb/bin/mongod() [0xe17cb9]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f313696ce9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f3135c7fcbd]
Mon May 13 19:08:15.667 [rsSync]
 
***aborting after fassert() failure
 
 
Mon May 13 19:08:15.667 Got signal: 6 (Aborted).
 
Mon May 13 19:08:15.671 Backtrace:
0xdcf361 0x6cf729 0x7f3135bc24a0 0x7f3135bc2425 0x7f3135bc5b8b 0xd8f10e 0xc03b0f 0xc21811 0xc218ad 0xc21b7c 0xe17cb9 0x7f313696ce9a 0x7f3135c7fcbd
 /home/mongo/mongodb/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6cf729]
 /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f3135bc24a0]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f3135bc2425]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f3135bc5b8b]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xd8f10e]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xc03b0f]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc21811]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc218ad]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc21b7c]
 /home/mongo/mongodb/bin/mongod() [0xe17cb9]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f313696ce9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f3135c7fcbd]

We have even simplified the setup to one replica trying to resync from master so that we reduce the load on the network. Same issue. I'll be attaching the full logs from both master and the crashing secondary in a bit. Following the suggestions in the other tickets, I also ran tcpdump during the tests to capture any possible network issues. Those logs are huge so I'm trying to truncate them to the time of the crash so that I can upload them.



 Comments   
Comment by George P. Stathis [ 26/Jun/13 ]

Just reading this now. Thanks Dan, I'll give it a shot and post back here with our findings.

Comment by Daniel Pasette (Inactive) [ 10/Jun/13 ]

Hi George,

Looked a little further and I think I see what's happening and can explain the difference between 2.2 and 2.4. The short answer is that a 3rd node should actually help (either arbiter or data node) keep the replica set stable during initial sync.

Longer answer. There seem to be load problems on the secondary, caused by the initial sync, which is causing the primary to think that the secondary is down. In a two node replica set, there is no possible majority, which causes the primary to step down, which automatically severs the initial sync connection, which causes the initial sync to restart from scratch.

You can see all the times it restarted. Sometimes it got 2 hours in, sometimes it only made it 3 minutes:

$ grep 'replSet initial sync exception' mongodb.secondary.log
Mon May 13 11:49:11.227 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 9 attempts remaining
Mon May 13 12:27:00.427 [rsSync] replSet initial sync exception: 13386 socket error for mapping query 8 attempts remaining
Mon May 13 14:03:00.380 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 7 attempts remaining
Mon May 13 15:23:40.588 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 6 attempts remaining
Mon May 13 17:23:06.253 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 5 attempts remaining
Mon May 13 17:50:18.909 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 4 attempts remaining
Mon May 13 18:07:25.952 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 3 attempts remaining
Mon May 13 18:10:10.334 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 2 attempts remaining
Mon May 13 18:41:05.615 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 1 attempts remaining
Mon May 13 19:07:45.606 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 0 attempts remaining

There were some changes in 2.4 which should help improve stability of the set as a whole, and should prevent flapping (what you're seeing here). Heartbeat connections are retried and replica set state is checked symmetrically; that is, each node checks not only that their replica set heartbeat is responded to, but also whether it has not received a heartbeat from a node before marking it as down.

There were also some changes to make replica set failover faster in 2.4. This is a good thing, but in a 2 node replica set, it can make the set overly sensitive to network blips.

There is a lot more work to be done on making initial sync tolerant of this kind of failure, but I'd be interested if you could retry this with an arbiter or 3rd set member.

Comment by George P. Stathis [ 28/May/13 ]

Hi Dan,

Regarding the read-ahead, we have since corrected this and set it to 16K as recommended in http://docs.mongodb.org/manual/administration/production-notes/#readahead and http://www.10gen.com/blog/post/provisioned-iops-aws-marketplace-significantly-boosts-mongodb-performance-ease-use .

Regarding the arbiter, we were originally running this RS with three members, one master, one secondary and one non-eligible box for backups. We were seeing the same issue with all three nodes up so we decided to remove the non-eligible box in an attempt to alleviate any assumed network load. This didn't help either. Given this, do you still think we should try an arbiter with no data?

Interestingly, we have reverted all RS members back to version 2.2.2 and performed a full-resync without a problem: this issue does not occur on 2.2.2 for us. Same boxes, same network, same RS config. This issue only appeared when we updated to 2.4.3.

Comment by Daniel Pasette (Inactive) [ 28/May/13 ]

Hi George,
Sorry to hear of your troubles syncing. This ticket was accidentally filed as "Debugging with Submitter" but without an assignee, which has caused it to languish.

First thing I notice is the warning to tell you to turn your readahead setting down:

Mon May 13 09:44:55.390 [initandlisten] ** WARNING: Readahead for /vol/mongo is set to 4096KB
Mon May 13 09:44:55.390 [initandlisten] **          We suggest setting it to 256KB (512 sectors) or less
Mon May 13 09:44:55.390 [initandlisten] **          http://dochub.mongodb.org/core/readahead

This is probably not your current issue, but wanted to mention it.

Next, I see lots of network errors throughout the sync process:

Mon May 13 12:27:00.426 [rsSync] Socket recv() errno:104 Connection reset by peer 10.71.26.90:27017
Mon May 13 12:27:00.426 [rsSync] SocketException: remote: 10.71.26.90:27017 error: 9001 socket exception [1] server [10.71.26.90:27017]
Mon May 13 12:27:00.427 [rsSync] DBClientCursor::init call() failed
Mon May 13 12:27:00.427 [rsSync] replSet initial sync exception: 13386 socket error for mapping query 8 attempts remaining

and then finally, it tried to connect 10 times and eventually gave up, causing it to fassert and abort the replication attempt (see the "0 attempts remaining" log line):

Mon May 13 19:07:45.606 [rsSync] Socket flush send() errno:9 Bad file descriptor 10.71.26.90:27017
Mon May 13 19:07:45.606 [rsSync]   caught exception (socket exception [SEND_ERROR] for 10.71.26.90:27017) in destructor (~PiggyBackData)
Mon May 13 19:07:45.606 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 0 attempts remaining
Mon May 13 19:07:45.614 [rsHealthPoll] replSet member ip-10-71-26-90.ec2.internal:27017 is up
Mon May 13 19:07:45.614 [rsHealthPoll] replSet member ip-10-71-26-90.ec2.internal:27017 is now in state SECONDARY
Mon May 13 19:07:45.700 [initandlisten] connection accepted from 10.71.26.90:50732 #2175 (28 connections now open)
Mon May 13 19:07:45.703 [conn2175]  authenticate db: local { authenticate: 1, nonce: "779ae9aba7a706e4", user: "__system", key: "e7f4453059498c244523f2794a04c216" }
Mon May 13 19:07:45.705 [conn2175] replSet info voting yea for ip-10-71-26-90.ec2.internal:27017 (0)
Mon May 13 19:07:47.617 [rsHealthPoll] replSet member ip-10-71-26-90.ec2.internal:27017 is now in state PRIMARY
Mon May 13 19:07:59.673 [initandlisten] connection accepted from 10.71.26.90:50738 #2176 (29 connections now open)
Mon May 13 19:07:59.677 [conn2176]  authenticate db: admin { authenticate: 1, user: "mongo", nonce: "8f94b9003dc3c980", key: "ad59ed669b28d961e6db34033c785980" }
Mon May 13 19:07:59.691 [conn2176] end connection 10.71.26.90:50738 (28 connections now open)
Mon May 13 19:08:13.751 [conn2175] end connection 10.71.26.90:50732 (27 connections now open)
Mon May 13 19:08:13.753 [initandlisten] connection accepted from 10.71.26.90:50749 #2177 (28 connections now open)
Mon May 13 19:08:13.755 [conn2177]  authenticate db: local { authenticate: 1, nonce: "f4fb648338aebbc9", user: "__system", key: "77d24c9ffe7f0e3fe2749d1c042562d0" }
Mon May 13 19:08:15.606 [rsSync]   Fatal Assertion 16233
0xdcf361 0xd8f0d3 0xc03b0f 0xc21811 0xc218ad 0xc21b7c 0xe17cb9 0x7f313696ce9a 0x7f3135c7fcbd
 /home/mongo/mongodb/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd8f0d3]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xc03b0f]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc21811]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc218ad]
 /home/mongo/mongodb/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc21b7c]
 /home/mongo/mongodb/bin/mongod() [0xe17cb9]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f313696ce9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f3135c7fcbd]
Mon May 13 19:08:15.667 [rsSync]
 
***aborting after fassert() failure

On the primary, at the time of the failure, it looks like you are building an index in the foreground at about the same time as the final failure to connect. This shouldn't cause an initial sync failure, but definitely adds load to your primary.

Have you tried adding an arbiter to your cluster to try and cut down the flapping you're seeing?

Comment by George P. Stathis [ 28/May/13 ]

Hey guys, has anyone had a chance to take a look at this? Still no luck on our side here.

Comment by George P. Stathis [ 14/May/13 ]

Also:

$ cat /proc/sys/net/ipv4/tcp_keepalive_time
300

which we understand is the recommended setting for EC2: http://docs.mongodb.org/ecosystem/platforms/amazon-ec2/

Comment by George P. Stathis [ 14/May/13 ]

Adding the last 9 minutes or so of the tcpdump activity on port 27017 on secondary before the crash occurs.

Generated at Thu Feb 08 03:21:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.