[SERVER-20711] Fatal assertion in secondary when sync failed due to network problems Created: 01/Oct/15  Updated: 10/Feb/16  Resolved: 01/Oct/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: samuel charron Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-22563 Establishing remote cursor during rol... Closed
Operating System: ALL
Participants:

 Description   

We have a 3 machines mongodb setup : one primary and 2 secondaries.
One of the secondaries crashed without explanation.

Setup:
primary : ip 1.2.3.4
secondary-1 : the machine that crashed
secondary-2 : ip 5.6.7.8

On all machines : mongod version v3.0.6-rc0

Log:

2015-09-29T15:56:42.683+0200 I NETWORK  [ReplExecNetThread-3242] Socket recv() timeout  1.2.3.4:27017
2015-09-29T15:56:42.683+0200 I NETWORK  [ReplExecNetThread-3242] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [1.2.3.4:27017] 
2015-09-29T15:56:42.683+0200 I NETWORK  [ReplExecNetThread-3242] DBClientCursor::init call() failed
2015-09-29T15:56:42.683+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to primary:27017; Location10276 DBClientBase::findN: transport error: primary:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D6F686177504275334D5231345A474F554351326350474267525A4632507A5454) }
2015-09-29T15:56:42.684+0200 I REPL     [ReplicationExecutor] Standing for election
2015-09-29T15:56:42.739+0200 I REPL     [ReplicationExecutor] not electing self, secondary-2:27017 would veto with 'secondary-1:27017 is trying to elect itself but primary:27017 is already primary and more up-to-date'
2015-09-29T15:56:42.739+0200 I REPL     [ReplicationExecutor] not electing self, we are not freshest
2015-09-29T15:56:43.709+0200 I NETWORK  [conn62818] end connection 5.6.7.8:57459 (10 connections now open) 
2015-09-29T15:56:43.721+0200 I NETWORK  [initandlisten] connection accepted from 5.6.7.8:57462 #62820 (11 connections now open) 
2015-09-29T15:56:43.774+0200 I ACCESS   [conn62820] Successfully authenticated as principal __system on local
2015-09-29T15:56:44.246+0200 I REPL     [ReplicationExecutor] replSetElect voting yea for secondary-2:27017 (5)
2015-09-29T15:56:44.635+0200 I REPL     [ReplicationExecutor] Member secondary-2:27017 is now in state PRIMARY
2015-09-29T15:56:54.687+0200 I NETWORK  [ReplExecNetThread-3258] Socket recv() timeout  1.2.3.4:27017
2015-09-29T15:56:54.687+0200 I NETWORK  [ReplExecNetThread-3258] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [1.2.3.4:27017] 
2015-09-29T15:56:54.687+0200 I NETWORK  [ReplExecNetThread-3258] DBClientCursor::init call() failed
2015-09-29T15:56:54.687+0200 I REPL     [ReplicationExecutor] Error in heartbeat request to primary:27017; Location10276 DBClientBase::findN: transport error: primary:27017 ns: local.$cmd query: { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: BinData(0, 6E2C2C6E3D5F5F73797374656D2C723D664D4C49456F584D7050524A3231325A665A6A32557A4B674977307041537978) }
2015-09-29T15:56:55.079+0200 E REPL     [rsBackgroundSync] sync producer problem: 10278 dbclient error communicating with server: primary:27017
2015-09-29T15:56:55.079+0200 I -        [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for primary:27017 (1.2.3.4) failed) in destructor (kill)
2015-09-29T15:56:55.080+0200 I REPL     [ReplicationExecutor] syncing from: secondary-2:27017
2015-09-29T15:56:55.159+0200 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to secondary-2:27017
2015-09-29T15:56:55.169+0200 I REPL     [rsBackgroundSync] replSet our last op time fetched: Sep 29 15:56:49:2
2015-09-29T15:56:55.169+0200 I REPL     [rsBackgroundSync] replset source's GTE: Sep 29 15:56:52:1
2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] beginning rollback
2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] rollback 0
2015-09-29T15:56:55.175+0200 I REPL     [ReplicationExecutor] transition to ROLLBACK
2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] rollback 1
2015-09-29T15:56:55.175+0200 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2015-09-29T15:56:55.199+0200 I NETWORK  [rsBackgroundSync] DBClientCursor::init call() failed
2015-09-29T15:56:55.208+0200 E REPL     [rsBackgroundSync] remote oplog empty or unreadable
2015-09-29T15:56:55.208+0200 I NETWORK  [SyncSourceFeedback] DBClientCursor::init call() failed
2015-09-29T15:56:55.208+0200 I -        [rsBackgroundSync] Fatal Assertion 18752
2015-09-29T15:56:55.208+0200 I -        [rsBackgroundSync] 
 
***aborting after fassert() failure



 Comments   
Comment by samuel charron [ 02/Oct/15 ]

@ramon.fernandez, will you reopen this ticket ?

Comment by samuel charron [ 01/Oct/15 ]

So mongod does not support temporary network problems ? #thisIsNotHowItShouldWork

Comment by Ramon Fernandez Marina [ 01/Oct/15 ]

The server did not crash: it deliberately shut down (that's what the fatal assertion means) when, due to external factors (networking problems) it could not complete the necessary operations (synchronizing from another node) required to remain a part of this replica set. I'll adjust the title of the ticket to avoid confusion.

Regards,
Ramón.

Comment by samuel charron [ 01/Oct/15 ]

Are you saying that a crash is not a bug ?

Comment by Ramon Fernandez Marina [ 01/Oct/15 ]

Thanks for the additional information samuel.charron, and glad to hear this secondary is back in business. It seems clear that the issues you're running into are due problems with the network and not because of a bug in mongod, so I'm going to close this ticket.

Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by samuel charron [ 01/Oct/15 ]

After restarting secondary-1, it had troubles with resynchronization with lots of log lines like these :

2015-10-01T11:04:05.691+0200 I NETWORK [rsBackgroundSync] Socket recv() timeout 1.2.3.4:27017
2015-10-01T11:04:05.691+0200 I NETWORK [rsBackgroundSync] SocketException: remote: 1.2.3.4:27017 error: 9001 socket exception [RECV_TIMEOUT] server [94.23.207.119:27017]
2015-10-01T11:04:05.691+0200 I - [rsBackgroundSync] caught exception (socket exception [FAILED_STATE] for primary:27017 (1.2.3.4) failed) in destructor (kill)
2015-10-01T11:04:05.691+0200 W REPL [rsBackgroundSync] rollback 2 exception 10278 dbclient error communicating with server: primary:27017; sleeping 1 min

But eventually it stopped, and the secondary is now running ok without troubles.

The full log does not look so interesting.
It's maily lines like these

2015-09-29T15:56:29.514+0200 I NETWORK [initandlisten] connection accepted from 1.2.3.4:57700 #62819 (11 connections now open)
2015-09-29T15:56:29.539+0200 I ACCESS [conn62819] Successfully authenticated as principal __system on local

every 15 seconds (alternating between primary and secondary-2 connections)

Comment by Ramon Fernandez Marina [ 01/Oct/15 ]

samuel.charron, the following lines in the log:

2015-09-29T15:56:55.199+0200 I NETWORK  [rsBackgroundSync] DBClientCursor::init call() failed
2015-09-29T15:56:55.208+0200 E REPL     [rsBackgroundSync] remote oplog empty or unreadable
2015-09-29T15:56:55.208+0200 I NETWORK  [SyncSourceFeedback] DBClientCursor::init call() failed
2015-09-29T15:56:55.208+0200 I -        [rsBackgroundSync] Fatal Assertion 18752

indicate that this node could not talk to secondary-2 to sync. This could have been caused by a transient network problem for example.

What happens when you restart secondary-1? Does the problem happen again? Can you send us full logs for secondary-1 so we can take a closer look?

Thanks,
Ramón.

Generated at Thu Feb 08 03:55:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.