[SERVER-26780] SyncTail::getMissingDoc() should retry on SocketExceptions Created: 26/Oct/16 Updated: 08/Jan/24 Resolved: 13/Sep/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.2.12, 3.4.2, 3.5.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Rob Clancy | Assignee: | Backlog - Replication Team |
| Resolution: | Done | Votes: | 0 |
| Labels: | initialSync | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
ubuntu, mongo 3.2.10 |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
A secondary is failing to perform the initial sync with another secondary to join a replica set. It fails due to a socket receive timeout when talking to the other secondary during the initial sync. I have attached the final lines of the log from the secondary trying to join the replica set. NB: we never see any "network problem detected" lines in our logs, so it seems as if there is never any retries: I think the SocketException due to the timeout is being caught earlier: I do not believe the fix in https://jira.mongodb.org/browse/SERVER-9528 was correct due to the exception swallowing. |
| Comments |
| Comment by A. Jesse Jiryu Davis [ 13/Sep/19 ] |
|
Obviated by |
| Comment by Judah Schvimer [ 12/Sep/19 ] |
|
jesse, this can be closed, correct? |
| Comment by Judah Schvimer [ 07/Feb/17 ] |
|
Dear rob.clancy@intercom.io, Thank you for filing this ticket! I think you are correct, that we swallow SocketExceptions in the MessagingPort and then convert them to DBExceptions in DBClientInterface. Thus when we hit this block in initial sync fetching missing documents, we do not retry. This code path does not appear to have changed since then, so it is still a problem in 3.2 and 3.4. I will move this into "Needs Triage" and change the summary to "SyncTail::getMissingDoc() should retry on SocketExceptions". We handle SocketExceptions separately from DBExceptions here because some DBExceptions will just keep happening no matter what if we retry. One fix would be to simply retry on all exceptions. The cost here seems low because when getMissingDoc fails, it leads to initial sync restarting. Delaying that initial sync restart by a few seconds is a small price to pay for network error resilience. We should only incur that delay once for errors that won't be fixed by retrying. The other fix would be to stop swallowing SocketExceptions. mira.carey@mongodb.com, do you know why we swallow those exceptions there? Thanks, |
| Comment by Rob Clancy [ 26/Oct/16 ] |
|
logs from the secondary which was being synced from |
| Comment by Rob Clancy [ 26/Oct/16 ] |
|
Logs from the node that was trying to join the replica set. |
| Comment by Ramon Fernandez Marina [ 26/Oct/16 ] |
|
rob.clancy@intercom.io, I don't see the logs attached, can you please upload them again? It would be useful to see the failing node as well as the logs for the sync source being used at the time by that node. Thanks, |