Investigate changes in SERVER-121174: Improve detail on initial sync log messages

XMLWordPrintableJSON

    • Type: Investigation
    • Resolution: Won't Fix
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Tools and Replicator

      Original Downstream Change Summary

      Adds new fields to the initial sync serverStatus section and log message.:
      The replSetGetStatus / serverStatus initial sync progress document gains several new fields:

      Field Level Description
      phase top-level Current phase string (e.g. "cloningData")
      totalAttempts Stats Total number of sync attempts so far
      bytesPerSecond Stats Approximate copy throughput (bytes/sec) derived from approxTotalBytesCopied and elapsed time
      Each entry in the initialSyncAttempts array now also includes:

      Field Description
      phase Phase the attempt was in when it ended (or "complete" on success)
      nodeUptimeSecs Node wall-clock uptime at the time the attempt finished
      phaseDurations Array of

      {phase, durationMillis}

      objects covering every phase the attempt went through

      Description of Linked Ticket

      We currently log initial sync metrics via log 21192 at the end of each sync:

      {
      "method": "logical",
      "failedInitialSyncAttempts": 4,
      "maxFailedInitialSyncAttempts": 10,
      "approxTotalBytesCopied": 1555009768,
      "approxTotalDataSize": 1556594951,
      "databases": { "databasesCloned": 3 },
      "appliedOps": 6,
      "totalInitialSyncElapsedMillis": 12862496,
      "initialSyncAttempts": [
      {
      "durationMillis": 0,
      "operationsRetried": 0,
      "status": "InitialSyncOplogSourceMissing: No valid sync source found...",
      "syncSource": ":27017",
      "totalTimeUnreachableMillis": 0
      },
      // ... more attempts ...
      ]
      }
      

      We filter Atlas ingestion on "attr.status": "successful", so we only collect this from syncs that ultimately succeed. We can remove this ingestion filter and start recording failed syncs. To get more value out of these logs we can add the following fields:

      Field Location Type Purpose Priority
      phase initialSyncAttempts[i] string Which phase was active when each attempt ended/failed. Enables funnel chart: where do failures cluster? HIGH
      phase Top-level stats (live + final log) string Current phase visible in getInitialSyncProgress() while sync is running; preserved in the final log entry. HIGH
      totalAttempts Top-level stats int Total attempt count (succeeded + failed). Simplifies Atlas queries; no need to count array length. MEDIUM
      bytesPerSecond Top-level stats double Normalized cloning throughput. Directly answers "how fast is cloning?" without manual computation. MEDIUM
      nodeUptimeSecs Top-level stats int Node uptime at sync completion. Very short uptime relative to sync duration hints at a crash mid-sync, correlating with index rebuild or recovery issues. MEDIUM

      Phase Values (in order)

      startupchoosingSyncSourcetruncatingOplogcheckingRollbackbeginFetchingOpTimegettingFCVcloningDatabasesgettingStopTimestampmultiApplierrollbackCheckerCheckForRollback

      Fields Out of Scope

      Field Reason
      Replication lag during sync Requires periodic sampling of sync source lastApplied vs local lastApplied; no existing hook in InitialSyncer.
      Frequency of restarts during sync Requires a cross-boot persistent counter; out of scope for a single syncer instance.
      Index build memory during sync Lives in IndexBuildsCoordinator; not accessible to InitialSyncer.
      Crashes per unit time Requires durable cross-boot storage; out of scope.

      Example: New Log 21192 Output

      {
      "method": "logical",
      "failedInitialSyncAttempts": 4,
      "maxFailedInitialSyncAttempts": 10,
      "totalAttempts": 5,
      "totalInitialSyncElapsedMillis": 12862496,
      "phase": "rollbackCheckerCheckForRollback",
      "nodeUptimeSecs": 47,
      "approxTotalDataSize": 1556594951,
      "approxTotalBytesCopied": 1555009768,
      "bytesPerSecond": 120942.8,
      "appliedOps": 6,
      "databases": { "databasesCloned": 3 },
      "initialSyncAttempts": [
      {
      "durationMillis": 0,
      "status": "InitialSyncOplogSourceMissing: No valid sync source found...",
      "syncSource": ":27017",
      "operationsRetried": 0,
      "totalTimeUnreachableMillis": 0,
      "phase": "choosingSyncSource"
      },
      {
      "durationMillis": 0,
      "status": "InitialSyncOplogSourceMissing: No valid sync source found...",
      "syncSource": ":27017",
      "operationsRetried": 0,
      "totalTimeUnreachableMillis": 0,
      "phase": "choosingSyncSource"
      },
      {
      "durationMillis": 0,
      "status": "HostUnreachable: ...short read",
      "syncSource": "atlas",
      "operationsRetried": 0,
      "totalTimeUnreachableMillis": 0,
      "phase": "checkingRollback"
      },
      {
      "durationMillis": 12777492,
      "status": "InvalidSyncSource: ...Upstream node rolled back...",
      "syncSource": "atlas",
      "operationsRetried": 23475,
      "rollBackId": 2,
      "totalTimeUnreachableMillis": 12751131,
      "phase": "cloningDatabases"
      },
      {
      "durationMillis": 46922,
      "status": "OK",
      "syncSource": "atlas",
      "operationsRetried": 1,
      "rollBackId": 1,
      "totalTimeUnreachableMillis": 3,
      "phase": "rollbackCheckerCheckForRollback"
      }
      ]
      }
      

      Reading this example:

      • Attempts 0–1 failed at choosingSyncSource — no valid sync source available yet
      • Attempt 2 failed at checkingRollback — network instability reaching the primary
      • Attempt 3 failed at cloningDatabases — sync source rolled back mid-clone (~3.5 hours wasted)
      • Attempt 4 succeeded; nodeUptimeSecs: 47 suggests the node restarted at some point during the process
      • bytesPerSecond: 120942.8 (~121 KB/s) gives the normalized cloning rate

            Assignee:
            Unassigned
            Reporter:
            Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: