• Type: Sub-task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Replication
    • None
    • Replication
    • Repl 2026-03-02, Repl 2026-03-16
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      We currently log initial sync metrics via log 21192 at the end of each sync:

      {
      "method": "logical",
      "failedInitialSyncAttempts": 4,
      "maxFailedInitialSyncAttempts": 10,
      "approxTotalBytesCopied": 1555009768,
      "approxTotalDataSize": 1556594951,
      "databases": { "databasesCloned": 3 },
      "appliedOps": 6,
      "totalInitialSyncElapsedMillis": 12862496,
      "initialSyncAttempts": [
      {
      "durationMillis": 0,
      "operationsRetried": 0,
      "status": "InitialSyncOplogSourceMissing: No valid sync source found...",
      "syncSource": ":27017",
      "totalTimeUnreachableMillis": 0
      },
      // ... more attempts ...
      ]
      }
      

      We filter Atlas ingestion on "attr.status": "successful", so we only collect this from syncs that ultimately succeed. We can remove this ingestion filter and start recording failed syncs. To get more value out of these logs we can add the following fields:

      Field Location Type Purpose Priority
      phase initialSyncAttempts[i] string Which phase was active when each attempt ended/failed. Enables funnel chart: where do failures cluster? HIGH
      phase Top-level stats (live + final log) string Current phase visible in getInitialSyncProgress() while sync is running; preserved in the final log entry. HIGH
      totalAttempts Top-level stats int Total attempt count (succeeded + failed). Simplifies Atlas queries; no need to count array length. MEDIUM
      bytesPerSecond Top-level stats double Normalized cloning throughput. Directly answers "how fast is cloning?" without manual computation. MEDIUM
      nodeUptimeSecs Top-level stats int Node uptime at sync completion. Very short uptime relative to sync duration hints at a crash mid-sync, correlating with index rebuild or recovery issues. MEDIUM

      Phase Values (in order)

      startupchoosingSyncSourcetruncatingOplogcheckingRollbackbeginFetchingOpTimegettingFCVcloningDatabasesgettingStopTimestampmultiApplierrollbackCheckerCheckForRollback

      Fields Out of Scope

      Field Reason
      Replication lag during sync Requires periodic sampling of sync source lastApplied vs local lastApplied; no existing hook in InitialSyncer.
      Frequency of restarts during sync Requires a cross-boot persistent counter; out of scope for a single syncer instance.
      Index build memory during sync Lives in IndexBuildsCoordinator; not accessible to InitialSyncer.
      Crashes per unit time Requires durable cross-boot storage; out of scope.

      Example: New Log 21192 Output

      {
      "method": "logical",
      "failedInitialSyncAttempts": 4,
      "maxFailedInitialSyncAttempts": 10,
      "totalAttempts": 5,
      "totalInitialSyncElapsedMillis": 12862496,
      "phase": "rollbackCheckerCheckForRollback",
      "nodeUptimeSecs": 47,
      "approxTotalDataSize": 1556594951,
      "approxTotalBytesCopied": 1555009768,
      "bytesPerSecond": 120942.8,
      "appliedOps": 6,
      "databases": { "databasesCloned": 3 },
      "initialSyncAttempts": [
      {
      "durationMillis": 0,
      "status": "InitialSyncOplogSourceMissing: No valid sync source found...",
      "syncSource": ":27017",
      "operationsRetried": 0,
      "totalTimeUnreachableMillis": 0,
      "phase": "choosingSyncSource"
      },
      {
      "durationMillis": 0,
      "status": "InitialSyncOplogSourceMissing: No valid sync source found...",
      "syncSource": ":27017",
      "operationsRetried": 0,
      "totalTimeUnreachableMillis": 0,
      "phase": "choosingSyncSource"
      },
      {
      "durationMillis": 0,
      "status": "HostUnreachable: ...short read",
      "syncSource": "atlas",
      "operationsRetried": 0,
      "totalTimeUnreachableMillis": 0,
      "phase": "checkingRollback"
      },
      {
      "durationMillis": 12777492,
      "status": "InvalidSyncSource: ...Upstream node rolled back...",
      "syncSource": "atlas",
      "operationsRetried": 23475,
      "rollBackId": 2,
      "totalTimeUnreachableMillis": 12751131,
      "phase": "cloningDatabases"
      },
      {
      "durationMillis": 46922,
      "status": "OK",
      "syncSource": "atlas",
      "operationsRetried": 1,
      "rollBackId": 1,
      "totalTimeUnreachableMillis": 3,
      "phase": "rollbackCheckerCheckForRollback"
      }
      ]
      }
      

      Reading this example:

      • Attempts 0–1 failed at choosingSyncSource — no valid sync source available yet
      • Attempt 2 failed at checkingRollback — network instability reaching the primary
      • Attempt 3 failed at cloningDatabases — sync source rolled back mid-clone (~3.5 hours wasted)
      • Attempt 4 succeeded; nodeUptimeSecs: 47 suggests the node restarted at some point during the process
      • bytesPerSecond: 120942.8 (~121 KB/s) gives the normalized cloning rate

            Assignee:
            Denis Trailin
            Reporter:
            Denis Trailin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: