[SERVER-9057] Replica set crashes during sync Created: 21/Mar/13  Updated: 10/Dec/14  Resolved: 09/Apr/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Stan Hu Assignee: J Rassi
Resolution: Cannot Reproduce Votes: 0
Labels: initialSync, replication
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive mongodb-crash.zip    
Operating System: Linux
Participants:

 Description   

I tried to spinning up a new replica set from a previously unreplicated database. I upgraded to Mongo 2.4.0 on both machines. I killed the local.* files on both machines, restarted them with the replSet option, and then ran rs.initiate() and rs.add() on the second machine.

Sometime during the synching phase, both servers crash. The secondary one reported:

Wed Mar 20 22:07:27.102 [rsSync] clone sensordb.gas_readings_by_hour 121215
Wed Mar 20 22:07:42.825 [conn440] end connection 10.10.0.2:36829 (0 connections now open)
Wed Mar 20 22:07:42.826 [initandlisten] connection accepted from 10.10.0.2:37534 #441 (2 connections now open)
Wed Mar 20 22:07:53.884 [rsHealthPoll] replset info owl:27017 thinks that we are down
Wed Mar 20 22:08:12.296 [initandlisten] connection accepted from 10.10.0.2:40000 #442 (2 connections now open)
Wed Mar 20 22:08:12.296 [initandlisten] connection accepted from 10.10.0.2:54924 #443 (3 connections now open)
Wed Mar 20 22:07:54.209 [rsSync] Socket flush send() errno:9 Bad file descriptor 10.10.0.2:27017
Wed Mar 20 22:08:12.296 [rsHealthPoll] replSet member owl:27017 is now in state SECONDARY
Wed Mar 20 22:08:12.296 [rsSync]   caught exception (socket exception [SEND_ERROR] for 10.10.0.2:27017) in destructor (~PiggyBackData)
Wed Mar 20 22:08:12.296 [conn441] end connection 10.10.0.2:37534 (2 connections now open)
Wed Mar 20 22:08:12.296 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 0 attempts remaining
Wed Mar 20 22:08:12.296 [conn442] end connection 10.10.0.2:40000 (1 connection now open)
Wed Mar 20 22:08:15.201 [DataFileSync] flushing mmaps took 36971ms  for 67 files
Wed Mar 20 22:08:18.305 [conn443] replSet info voting yea for owl:27017 (0)
Wed Mar 20 22:08:20.305 [rsHealthPoll] replSet member owl:27017 is now in state PRIMARY
Wed Mar 20 22:08:38.313 [conn443] end connection 10.10.0.2:54924 (0 connections now open)
Wed Mar 20 22:08:38.313 [initandlisten] connection accepted from 10.10.0.2:53202 #444 (1 connection now open)
Wed Mar 20 22:08:42.296 [rsSync]   Fatal Assertion 16233
0xdcae01 0xd8ab83 0xc0230f 0xc1df91 0xc1edad 0xc1f07c 0xe13709 0x7f7f5caa8e9a 0x7f7f5bdbbcbd
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcae01]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd8ab83]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xc0230f]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc1df91]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc1edad]
 /usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc1f07c]
 /usr/bin/mongod() [0xe13709]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f7f5caa8e9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f7f5bdbbcbd]
Wed Mar 20 22:08:42.300 [rsSync]
 
***aborting after fassert() failure
 
 
Wed Mar 20 22:08:42.300 Got signal: 6 (Aborted).
 
Wed Mar 20 22:08:42.304 Backtrace:
0xdcae01 0x6ce879 0x7f7f5bcfe4a0 0x7f7f5bcfe425 0x7f7f5bd01b8b 0xd8abbe 0xc0230f 0xc1df91 0xc1edad 0xc1f07c 0xe13709 0x7f7f5caa8e9a 0x7f7f5bdbbcbd
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcae01]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6ce879]
 /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f7f5bcfe4a0]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f7f5bcfe425]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f7f5bd01b8b]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xd8abbe]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xc0230f]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc1df91]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc1edad]
 /usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc1f07c]
 /usr/bin/mongod() [0xe13709]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f7f5caa8e9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f7f5bdbbcbd]



 Comments   
Comment by Jason McCay [ 17/Aug/13 ]

Would any of the contributors to this ticket that are seeing this issue be willing to offer the AWS Region that they are running this on when they see this error?

For example, we have a customer seeing similar behavior, but it seems to be in AWS US West 2 (Oregon). We don't seem to run into this issue in other AWS data centers.

Comment by J Rassi [ 08/Aug/13 ]

hassek, bogdan.patrascoiu: could you both open separate tickets (and, when creating them, include full mongod logs for at least the primary and failing secondary)?

Comment by Xiang Jun Wu [ 08/Aug/13 ]

I also run into the same issues and my os is ubuntu 12.4 either.
See https://jira.mongodb.org/browse/SERVER-10463

Comment by Bogdan Patrascoiu [ 08/Aug/13 ]

I've been running into this for the last couple of days.

After the Primary->Secondary replication finishes the secondary starts to not be found by the primary and the arbiter "primary/arbiter thinks that we are down":

-cut here-
Wed Aug 7 19:48:25.021 [rsSyncNotifier] Socket recv() timeout 10.64.134.7:PORT
Wed Aug 7 19:48:25.021 [rsSyncNotifier] SocketException: remote: 10.64.134.7:PORT error: 9001 socket exception [3] server [10.64.134.7:27018]
Wed Aug 7 19:48:25.021 [rsSyncNotifier] DBClientCursor::init call() failed
Wed Aug 7 19:48:25.943 [initandlisten] connection accepted from 127.0.0.1:44228 #10795 (71 connections now open)
Wed Aug 7 19:48:26.167 [rsSyncNotifier] replset tracking exception: exception: 9001 socket exception [FAILED_STATE] for ubvu-mongodb3:27018
Wed Aug 7 19:49:24.022 [rsSync] replSet initialSyncOplogApplication applied 334915 operations, synced to Aug 7 14:50:07:52
Wed Aug 7 19:49:24.262 [rsBackgroundSync] replSet syncing to: PRIMARY:PORT
Wed Aug 7 19:49:24.538 [conn10829] end connection 127.0.0.1:44328 (70 connections now open)
Wed Aug 7 19:49:25.588 [initandlisten] connection accepted from 127.0.0.1:44330 #10830 (71 connections now open)
Wed Aug 7 19:49:26.597 [conn10830] end connection 127.0.0.1:44330 (70 connections now open)
Wed Aug 7 19:49:27.646 [initandlisten] connection accepted from 127.0.0.1:44332 #10831 (71 connections now open)
Wed Aug 7 19:49:28.654 [conn10831] end connection 127.0.0.1:44332 (70 connections now open)
Wed Aug 7 19:49:29.702 [initandlisten] connection accepted from 127.0.0.1:44333 #10832 (71 connections now open)
Wed Aug 7 19:49:30.711 [conn10832] end connection 127.0.0.1:44333 (70 connections now open)
Wed Aug 7 19:49:31.049 [rsSyncNotifier] Socket recv() timeout 10.64.134.7(PRIMARY-IP):PORT
Wed Aug 7 19:49:31.049 [rsSyncNotifier] SocketException: remote: 10.64.134.7(PRIMARY-IP):PORT error: 9001 socket exception [3] server [10.64.134.7:27018]
Wed Aug 7 19:49:31.049 [rsSyncNotifier] DBClientCursor::init call() failed
Wed Aug 7 19:49:31.760 [initandlisten] connection accepted from 127.0.0.1:44334 #10833 (71 connections now open)
Wed Aug 7 19:49:34.542 [initandlisten] connection accepted from 10.28.221.31:34964 #10834 (72 connections now open)
Wed Aug 7 19:49:35.948 [rsHealthPoll] replset info PRIMARY-HOST:PORT thinks that we are down
Wed Aug 7 19:49:36.011 [rsHealthPoll] replset info ARBTER:PORT thinks that we are down
-cut here-

On stop/start of the secondary mongod process all data is dropped and the sync stars again.

My data set is around 500G with and additional 100G that go the oplog.
The primary(ec2: m2.2xlarge,Ubuntu 12.04)->secondary(ec2: m2.xlarge,Ubuntu 12.04) sync was done with the primary having around ~30inserts/second.
I've reproduced it on two different shards(same config as above) using 2.4.3 and 2.4.5.

Connectivity and OS I couldn't find anything wrong.

I'll next try to stop all writes and:

  • sync the primary->secondary for one shard;
  • rsync the data on the secondary for the other shard.

Are there any other updates/related info for this issue ?

Comment by Tomas Henriquez [ 15/Jul/13 ]

I am having the exact same problem. I have created the replica set from scratch and I keep receiving "replset info MYHOST:10001 thinks that we are down". After a time the primary recognise it
"replSet member HOSTNAME:10001 is up"
"replSet member HOSTNAME:10001 is now in state STARTUP2"

and then:

"DBClientCursor::init call() failed"
"replSet info HOSTNAME:10001 is down (or slow to respond)"
"replSet member HOSTNAME:10001 is now in state DOWN"

then if I restart mongo, it reconnects, deletes everything and starts copying the complete set again.

using:
mongodb 2.4.5
Ubuntu 10.04.4 LTS on amazon web services

Comment by J Rassi [ 09/Apr/13 ]

I'll close this ticket, then; please re-open if you later encounter the same issue.

Comment by Stan Hu [ 09/Apr/13 ]

Yes, after I upgraded to 2.4.1, the initial sync worked fine.

Comment by J Rassi [ 05/Apr/13 ]

Stan, did the most recent initial sync complete?

Comment by Marat Shakirov [ 01/Apr/13 ]

Jason,Eliot https://jira.mongodb.org/browse/SERVER-9199

Comment by J Rassi [ 01/Apr/13 ]

Marat: could you open a separate ticket for your issue? When creating it, please upload the full mongod logs for both members.

Comment by Eliot Horowitz (Inactive) [ 01/Apr/13 ]

In the log snippet sent, there was a socket exception.
What is the network topology between the nodes?

Can you send the secondary log from the last attempt?

Comment by Marat Shakirov [ 01/Apr/13 ]

An update: my mongodb crashed on version 2.4.1. Primary and replica both.

Comment by Stan Hu [ 01/Apr/13 ]

I upgraded to Mongo 2.4.1 and restarted a new replica set sync. We'll see what happens.

Comment by Marat Shakirov [ 31/Mar/13 ]

Confirm this bug. geo distributed new replica in set was fail initial sync:

ubuntu 12.04 mongodb 2.4

Mon Apr  1 00:04:46.934 [rsSync] 	 fastBuildIndex dupsToDrop:0
Mon Apr  1 00:04:47.099 [rsSync] build index done.  scanned 39325008 total records. 685.137 secs
Mon Apr  1 00:04:48.525 [rsSync] Socket say send() errno:104 Connection reset by peer 176.9.128.155:27017
Mon Apr  1 00:04:48.536 [rsSync] replSet initial sync exception: 9001 socket exception [2] server [176.9.128.155:27017]  0 attempts remaining
Mon Apr  1 00:04:59.297 [conn1739] end connection 88.198.197.220:40416 (1 connection now open)
Mon Apr  1 00:04:59.330 [initandlisten] connection accepted from 88.198.197.220:40441 #1741 (2 connections now open)
Mon Apr  1 00:05:00.757 [conn1740] end connection 176.9.128.155:49908 (1 connection now open)
Mon Apr  1 00:05:00.791 [initandlisten] connection accepted from 176.9.128.155:49911 #1742 (2 connections now open)
Mon Apr  1 00:05:18.539 [rsSync]   Fatal Assertion 16233
0xdc7f71 0xd87cf3 0xbff47f 0xc1b101 0xc1bf1d 0xc1c1ec 0xe10879 0x7f330ccc7e9a 0x7f330bfdacbd 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdc7f71]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd87cf3]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xbff47f]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc1b101]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc1bf1d]
 /usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc1c1ec]
 /usr/bin/mongod() [0xe10879]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f330ccc7e9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f330bfdacbd]
Mon Apr  1 00:05:18.555 [rsSync] 
 
***aborting after fassert() failure
 
 
Mon Apr  1 00:05:18.555 Got signal: 6 (Aborted).
 
Mon Apr  1 00:05:18.562 Backtrace:
0xdc7f71 0x6ce459 0x7f330bf1d4a0 0x7f330bf1d425 0x7f330bf20b8b 0xd87d2e 0xbff47f 0xc1b101 0xc1bf1d 0xc1c1ec 0xe10879 0x7f330ccc7e9a 0x7f330bfdacbd 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdc7f71]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6ce459]
 /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f330bf1d4a0]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f330bf1d425]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f330bf20b8b]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xd87d2e]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x6f) [0xbff47f]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x71) [0xc1b101]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x2d) [0xc1bf1d]
 /usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0x6c) [0xc1c1ec]
 /usr/bin/mongod() [0xe10879]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f330ccc7e9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f330bfdacbd]

in same time on primary:

Mon Apr  1 00:05:18.739 [conn13240] end connection 168.63.87.58:1245 (32 connections now open)
Mon Apr  1 00:05:19.078 [rsHealthPoll] DBClientCursor::init call() failed
Mon Apr  1 00:05:19.120 [rsHealthPoll] replset info repl-10:27017 heartbeat failed, retrying
Mon Apr  1 00:05:19.162 [rsHealthPoll] replSet info repl-10:27017 is down (or slow to respond): 
Mon Apr  1 00:05:19.162 [rsHealthPoll] replSet member repl-10:27017 is now in state DOWN
Mon Apr  1 00:05:21.196 [rsHealthPoll] replset info repl-10:27017 heartbeat failed, retrying
Mon Apr  1 00:05:21.745 [conn332] command songs.$cmd command: { count: "tracks_replication_log", query: { timestamp: { $gte: new Date(1364760261531) } } } ntoreturn:1 keyUpdates:0 numYields: 200 locks(micros) r:341585 reslen:48 212ms
Mon Apr  1 00:05:23.231 [rsHealthPoll] replset info repl-10:27017 heartbeat failed, retrying
Mon Apr  1 00:05:25.265 [rsHealthPoll] replset info repl-10:27017 heartbeat failed, retrying
Mon Apr  1 00:05:27.299 [rsHealthPoll] replset info repl-10:27017 heartbeat failed, retrying
Mon Apr  1 00:05:29.334 [rsHealthPoll] replset info repl-10:27017 heartbeat failed, retrying
Mon Apr  1 00:05:31.369 [rsHealthPoll] couldn't connect to repl-10:27017: couldn't connect to server repl-10:27017
Mon Apr  1 00:05:31.404 [rsHealthPoll] couldn't connect to repl-10:27017: couldn't connect to server repl-10:27017
Mon Apr  1 00:05:31.438 [rsHealthPoll] replset info repl-10:27017 heartbeat failed, retrying
Mon Apr  1 00:05:31.472 [rsHealthPoll] couldn't connect to repl-10:27017: couldn't connect to server repl-10:27017

 

Comment by J Rassi [ 25/Mar/13 ]

Checking back in again – could you upload the corresponding primary log for the same time period? Are you still able to reproduce this issue?

Comment by J Rassi [ 23/Mar/13 ]

I think the TCP connection was closed on the primary machine because it deemed the secondary too slow or down.

The message "is down (or slow to respond)" indicates that a member has missed heartbeats from another member. This wouldn't cause the primary to close the TCP connection, however; the server doesn't initiate a close on sockets for active clients, even those members with missed heartbeats. Thus my suspicion that it was OS-initiated.

A packet capture would indicate which side initiated the connection teardown. Could you leave tcpdump running during your next attempt to reproduce this?

If you could upload the log for the primary, that would help provide more context.

What I could believe is that flakiness in DNS resolving could be causing this problem. Does Mongo do hostname lookups periodically and/or re-initiate TCP connections on a frequent basis?

Nope – but what makes you suspect this?

Note also that SERVER-9087 was discovered today, which affects members that performed an initial sync using 2.4.0 while the primary was under write load. I didn't see any signs of this in your log, however, so I believe it is not the cause of the issue here. Can you confirm that your primary wasn't performing any writes at all during this time?

I'll have to pick this issue back up on Monday.

Comment by Stan Hu [ 22/Mar/13 ]

I see no evidence of network issues between the two machines. dmesg and /var/log/syslog look quiet with respective to network issues on both machines.

I think the TCP connection was closed on the primary machine because it deemed the secondary too slow or down. This caused the secondary to report the following log message before the one you quoted:

Wed Mar 20 22:07:53.884 [rsHealthPoll] replset info owl:27017 thinks that we are down

What I could believe is that flakiness in DNS resolving could be causing this problem. Does Mongo do hostname lookups periodically and/or re-initiate TCP connections on a frequent basis?

Comment by J Rassi [ 21/Mar/13 ]

I believe the problem is networking-related. This could be an OS-level configuration issue on one of the hosts involved, or an issue on the network.

See the following log excerpt from the snippet in your description:

Wed Mar 20 22:07:54.209 [rsSync] Socket flush send() errno:9 Bad file descriptor 10.10.0.2:27017

My interpretation here is that the operating system is closing the TCP connection carrying replication traffic between the two machines. Searching the log you posted shows this repeating every 5-30 minutes starting at Wed Mar 20 11:56:20.422 until the fatal assertion a few hours later, and then similarly after the server restart, starting at Wed Mar 20 19:13:35.048.

Have you seen any other evidence of networking-related issues on these machines? Is it possible that a misconfigured script is bringing the the network interface down/up at these times?

For both machines, could you attach to the ticket:

  • the output of running the dmesg command
  • the contents of /var/log/syslog
Comment by Stan Hu [ 21/Mar/13 ]

I'll upload the primary logs later--they are a bit big and slow for my connection here, but I just noticed the secondary clock was off by 254 seconds, which accounts for the difference in the timestamps.

Comment by Stan Hu [ 21/Mar/13 ]

I see in the primary log:

Wed Mar 20 22:12:03.964 [rsHealthPoll] DBClientCursor::init call() failed
Wed Mar 20 22:12:03.986 [rsHealthPoll] replSet info hawk.corp.aclimalabs.com:27017 is down (or slow to respond):
Wed Mar 20 22:12:03.986 [rsHealthPoll] replSet member hawk.corp.aclimalabs.com:27017 is now in state DOWN
Wed Mar 20 22:12:03.986 [rsMgr] can't see a majority of the set, relinquishing primary
Wed Mar 20 22:12:03.986 [rsMgr] replSet relinquishing primary state
Wed Mar 20 22:12:03.986 [rsMgr] replSet SECONDARY
Wed Mar 20 22:12:03.986 [rsMgr] replSet closing client sockets after relinquishing primary

I see similar patterns throughout the log.

Looks to me that the primary declared itself the secondary, and when they reconnected again the secondary assert failed as a result.

As a stopgap, that suggests we should bump the primary priority up.

Comment by J Rassi [ 21/Mar/13 ]

Could you post the log for the primary as well? Looks like the primary is unexpectedly closing the connection carrying the sync data – the primary's log should shed light on this.

Comment by Stan Hu [ 21/Mar/13 ]

Attached are the full logs of the secondary node.

Comment by Scott Hernandez (Inactive) [ 21/Mar/13 ]

Can you post the full logs for that node please?

Generated at Thu Feb 08 03:19:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.