Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.1.0-rc0
Affects Version/s: None
Component/s: Replication
Labels:
- auto-reverted
- repl-shortlist

Assigned Teams:

Replication
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.0, v7.0, v6.0
Sprint:
Repl 2024-09-16, Repl 2024-09-30, Repl 2024-10-14, Repl 2024-10-28
Linked BF Score:
200
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

If there are some network issues that causes the `collStats` command to fail with HostUnreachable network error, initial sync would not resume and retry on that collection (it would restart from the beginning in the next attempt). `collStats` is called as part of the preStage for collectionCloner. But in the BaseCloner class, which has preStage, runStage, and postStage, only runStage is inside a retry loop that will retry for retryable errors within `initialSyncTransientErrorRetryPeriodSeconds`. When something in preStage fails, it throws an exception and sets the initial sync failed status here and the exception is caught here. Because there's no retry loop, initial sync starts over from the beginning, instead of resuming from the collection that it had trouble cloning before.

preStage and postStage were designed to not have retry logic, as seen in this comment, so it might be better to move network activity out of preStage and postStage instead of adding a retry loop

Also, it might be good to be more clear in the log messages like:

"id":21183, "ctx":"ReplCoordExtern-5","msg":"Finished cloning data. Beginning oplog replay","attr":{"databaseClonerFinishStatus":"InitialSyncFailure: ###"}}

and

"id":21181, "ctx":"ReplCoordExtern-1","msg":"Finished fetching oplog during initial sync","attr":{"oplogFetcherFinishStatus":"CallbackCanceled: ###"}

because it sounds like initial sync is still continuing, but there has already been some error before this that will cause the callbacks to be canceled

Assignee:: Ruchitha Rajaghatta
Reporter:: Evelyn Wu
Participants:: Evelyn Wu, Githook User, Ruchitha Rajaghatta
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Aug 29 2024 07:59:23 PM UTC
Updated:: Oct 14 2024 01:53:56 PM UTC
Resolved:: Oct 14 2024 01:53:56 PM UTC

Details

Description

Attachments

Forms

Activity

People

Dates