-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
Replication
-
Fully Compatible
-
ALL
-
v8.0, v7.0, v6.0
-
Repl 2024-09-16, Repl 2024-09-30, Repl 2024-10-14, Repl 2024-10-28
-
200
If there are some network issues that causes the `collStats` command to fail with HostUnreachable network error, initial sync would not resume and retry on that collection (it would restart from the beginning in the next attempt). `collStats` is called as part of the preStage for collectionCloner. But in the BaseCloner class, which has preStage, runStage, and postStage, only runStage is inside a retry loop that will retry for retryable errors within `initialSyncTransientErrorRetryPeriodSeconds`. When something in preStage fails, it throws an exception and sets the initial sync failed status here and the exception is caught here. Because there's no retry loop, initial sync starts over from the beginning, instead of resuming from the collection that it had trouble cloning before.
preStage and postStage were designed to not have retry logic, as seen in this comment, so it might be better to move network activity out of preStage and postStage instead of adding a retry loop
Also, it might be good to be more clear in the log messages like:
"id":21183, "ctx":"ReplCoordExtern-5","msg":"Finished cloning data. Beginning oplog replay","attr":{"databaseClonerFinishStatus":"InitialSyncFailure: ###"}}
and
"id":21181, "ctx":"ReplCoordExtern-1","msg":"Finished fetching oplog during initial sync","attr":{"oplogFetcherFinishStatus":"CallbackCanceled: ###"}
because it sounds like initial sync is still continuing, but there has already been some error before this that will cause the callbacks to be canceled