[SERVER-9057] Replica set crashes during sync Created: 21/Mar/13 Updated: 10/Dec/14 Resolved: 09/Apr/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Stan Hu | Assignee: | J Rassi |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | initialSync, replication | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
I tried to spinning up a new replica set from a previously unreplicated database. I upgraded to Mongo 2.4.0 on both machines. I killed the local.* files on both machines, restarted them with the replSet option, and then ran rs.initiate() and rs.add() on the second machine. Sometime during the synching phase, both servers crash. The secondary one reported:
|
| Comments |
| Comment by Jason McCay [ 17/Aug/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Would any of the contributors to this ticket that are seeing this issue be willing to offer the AWS Region that they are running this on when they see this error? For example, we have a customer seeing similar behavior, but it seems to be in AWS US West 2 (Oregon). We don't seem to run into this issue in other AWS data centers. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 08/Aug/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
hassek, bogdan.patrascoiu: could you both open separate tickets (and, when creating them, include full mongod logs for at least the primary and failing secondary)? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Xiang Jun Wu [ 08/Aug/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I also run into the same issues and my os is ubuntu 12.4 either. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bogdan Patrascoiu [ 08/Aug/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've been running into this for the last couple of days. After the Primary->Secondary replication finishes the secondary starts to not be found by the primary and the arbiter "primary/arbiter thinks that we are down": - On stop/start of the secondary mongod process all data is dropped and the sync stars again. My data set is around 500G with and additional 100G that go the oplog. Connectivity and OS I couldn't find anything wrong. I'll next try to stop all writes and:
Are there any other updates/related info for this issue ? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tomas Henriquez [ 15/Jul/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I am having the exact same problem. I have created the replica set from scratch and I keep receiving "replset info MYHOST:10001 thinks that we are down". After a time the primary recognise it and then: "DBClientCursor::init call() failed" then if I restart mongo, it reconnects, deletes everything and starts copying the complete set again. using: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 09/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'll close this ticket, then; please re-open if you later encounter the same issue. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Stan Hu [ 09/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, after I upgraded to 2.4.1, the initial sync worked fine. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 05/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Stan, did the most recent initial sync complete? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marat Shakirov [ 01/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Jason,Eliot https://jira.mongodb.org/browse/SERVER-9199 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 01/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Marat: could you open a separate ticket for your issue? When creating it, please upload the full mongod logs for both members. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 01/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In the log snippet sent, there was a socket exception. Can you send the secondary log from the last attempt? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marat Shakirov [ 01/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
An update: my mongodb crashed on version 2.4.1. Primary and replica both. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Stan Hu [ 01/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I upgraded to Mongo 2.4.1 and restarted a new replica set sync. We'll see what happens. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marat Shakirov [ 31/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Confirm this bug. geo distributed new replica in set was fail initial sync: ubuntu 12.04 mongodb 2.4
in same time on primary:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 25/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Checking back in again – could you upload the corresponding primary log for the same time period? Are you still able to reproduce this issue? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 23/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The message "is down (or slow to respond)" indicates that a member has missed heartbeats from another member. This wouldn't cause the primary to close the TCP connection, however; the server doesn't initiate a close on sockets for active clients, even those members with missed heartbeats. Thus my suspicion that it was OS-initiated. A packet capture would indicate which side initiated the connection teardown. Could you leave tcpdump running during your next attempt to reproduce this? If you could upload the log for the primary, that would help provide more context.
Nope – but what makes you suspect this? Note also that I'll have to pick this issue back up on Monday. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Stan Hu [ 22/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I see no evidence of network issues between the two machines. dmesg and /var/log/syslog look quiet with respective to network issues on both machines. I think the TCP connection was closed on the primary machine because it deemed the secondary too slow or down. This caused the secondary to report the following log message before the one you quoted:
What I could believe is that flakiness in DNS resolving could be causing this problem. Does Mongo do hostname lookups periodically and/or re-initiate TCP connections on a frequent basis? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 21/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I believe the problem is networking-related. This could be an OS-level configuration issue on one of the hosts involved, or an issue on the network. See the following log excerpt from the snippet in your description:
My interpretation here is that the operating system is closing the TCP connection carrying replication traffic between the two machines. Searching the log you posted shows this repeating every 5-30 minutes starting at Wed Mar 20 11:56:20.422 until the fatal assertion a few hours later, and then similarly after the server restart, starting at Wed Mar 20 19:13:35.048. Have you seen any other evidence of networking-related issues on these machines? Is it possible that a misconfigured script is bringing the the network interface down/up at these times? For both machines, could you attach to the ticket:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Stan Hu [ 21/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'll upload the primary logs later--they are a bit big and slow for my connection here, but I just noticed the secondary clock was off by 254 seconds, which accounts for the difference in the timestamps. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Stan Hu [ 21/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I see in the primary log:
I see similar patterns throughout the log. Looks to me that the primary declared itself the secondary, and when they reconnected again the secondary assert failed as a result. As a stopgap, that suggests we should bump the primary priority up. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 21/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Could you post the log for the primary as well? Looks like the primary is unexpectedly closing the connection carrying the sync data – the primary's log should shed light on this. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Stan Hu [ 21/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Attached are the full logs of the secondary node. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 21/Mar/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Can you post the full logs for that node please? |