Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-94259

Move network activity from BaseCloner's preStage/postStage into runStage to make use of retry loop

    • Replication
    • Fully Compatible
    • ALL
    • v8.0, v7.0, v6.0
    • Repl 2024-09-16, Repl 2024-09-30, Repl 2024-10-14, Repl 2024-10-28
    • 200

      If there are some network issues that causes the `collStats` command to fail with HostUnreachable network error, initial sync would not resume and retry on that collection (it would restart from the beginning in the next attempt). `collStats` is called as part of the preStage for collectionCloner. But in the BaseCloner class, which has preStage, runStage, and postStage, only runStage is inside a retry loop that will retry for retryable errors within `initialSyncTransientErrorRetryPeriodSeconds`. When something in preStage fails, it throws an exception and sets the initial sync failed status here and the exception is caught here. Because there's no retry loop, initial sync starts over from the beginning, instead of resuming from the collection that it had trouble cloning before. 

      preStage and postStage were designed to not have retry logic, as seen in this comment, so it might be better to move network activity out of preStage and postStage instead of adding a retry loop

       

      Also, it might be good to be more clear in the log messages like:

      "id":21183, "ctx":"ReplCoordExtern-5","msg":"Finished cloning data. Beginning oplog replay","attr":{"databaseClonerFinishStatus":"InitialSyncFailure: ###"}}

      and 

      "id":21181, "ctx":"ReplCoordExtern-1","msg":"Finished fetching oplog during initial sync","attr":{"oplogFetcherFinishStatus":"CallbackCanceled: ###"}

      because it sounds like initial sync is still continuing, but there has already been some error before this that will cause the callbacks to be canceled

            Assignee:
            ruchitha.rajaghatta@mongodb.com Ruchitha Rajaghatta
            Reporter:
            evelyn.wu@mongodb.com Evelyn Wu
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: