[SERVER-26780] SyncTail::getMissingDoc() should retry on SocketExceptions Created: 26/Oct/16  Updated: 08/Jan/24  Resolved: 13/Sep/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.12, 3.4.2, 3.5.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Rob Clancy Assignee: Backlog - Replication Team
Resolution: Done Votes: 0
Labels: initialSync
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu, mongo 3.2.10


Attachments: Text File mms-mongo-1-106.log     Text File mms-mongo-1-110.log    
Issue Links:
Depends
depends on SERVER-42022 Attempt to remove initial sync missin... Closed
Related
is related to SERVER-27950 Add SocketException to the list of Ne... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

A secondary is failing to perform the initial sync with another secondary to join a replica set.

It fails due to a socket receive timeout when talking to the other secondary during the initial sync.

I have attached the final lines of the log from the secondary trying to join the replica set.

NB: we never see any "network problem detected" lines in our logs, so it seems as if there is never any retries:
https://github.com/mongodb/mongo/blob/r3.2.10/src/mongo/db/repl/sync_tail.cpp#L968-L969

I think the SocketException due to the timeout is being caught earlier:
https://github.com/mongodb/mongo/blob/r3.2.10/src/mongo/util/net/message_port.cpp#L204-L210
which then triggers the assertion exception
https://github.com/mongodb/mongo/blob/r3.2.10/src/mongo/client/dbclient.cpp#L811-L814

I do not believe the fix in https://jira.mongodb.org/browse/SERVER-9528 was correct due to the exception swallowing.



 Comments   
Comment by A. Jesse Jiryu Davis [ 13/Sep/19 ]

Obviated by SERVER-42022: In 4.3+ we no longer fetch missing documents during initial sync.

Comment by Judah Schvimer [ 12/Sep/19 ]

jesse, this can be closed, correct?

Comment by Judah Schvimer [ 07/Feb/17 ]

Dear rob.clancy@intercom.io,

Thank you for filing this ticket! I think you are correct, that we swallow SocketExceptions in the MessagingPort and then convert them to DBExceptions in DBClientInterface. Thus when we hit this block in initial sync fetching missing documents, we do not retry. This code path does not appear to have changed since then, so it is still a problem in 3.2 and 3.4.

I will move this into "Needs Triage" and change the summary to "SyncTail::getMissingDoc() should retry on SocketExceptions".

We handle SocketExceptions separately from DBExceptions here because some DBExceptions will just keep happening no matter what if we retry. One fix would be to simply retry on all exceptions. The cost here seems low because when getMissingDoc fails, it leads to initial sync restarting. Delaying that initial sync restart by a few seconds is a small price to pay for network error resilience. We should only incur that delay once for errors that won't be fixed by retrying.

The other fix would be to stop swallowing SocketExceptions. mira.carey@mongodb.com, do you know why we swallow those exceptions there?

Thanks,
Judah

Comment by Rob Clancy [ 26/Oct/16 ]

logs from the secondary which was being synced from

Comment by Rob Clancy [ 26/Oct/16 ]

Logs from the node that was trying to join the replica set.

Comment by Ramon Fernandez Marina [ 26/Oct/16 ]

rob.clancy@intercom.io, I don't see the logs attached, can you please upload them again? It would be useful to see the failing node as well as the logs for the sync source being used at the time by that node.

Thanks,
Ramón.

Generated at Thu Feb 08 04:13:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.