-
Type:
Sub-task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Replication
-
None
-
Replication
-
Repl 2026-03-02, Repl 2026-03-16
-
None
-
None
-
None
-
None
-
None
-
None
-
None
We currently log initial sync metrics via log 21192 at the end of each sync:
{
"method": "logical",
"failedInitialSyncAttempts": 4,
"maxFailedInitialSyncAttempts": 10,
"approxTotalBytesCopied": 1555009768,
"approxTotalDataSize": 1556594951,
"databases": { "databasesCloned": 3 },
"appliedOps": 6,
"totalInitialSyncElapsedMillis": 12862496,
"initialSyncAttempts": [
{
"durationMillis": 0,
"operationsRetried": 0,
"status": "InitialSyncOplogSourceMissing: No valid sync source found...",
"syncSource": ":27017",
"totalTimeUnreachableMillis": 0
},
// ... more attempts ...
]
}
We filter Atlas ingestion on "attr.status": "successful", so we only collect this from syncs that ultimately succeed. We can remove this ingestion filter and start recording failed syncs. To get more value out of these logs we can add the following fields:
| Field | Location | Type | Purpose | Priority |
|---|---|---|---|---|
| phase | initialSyncAttempts[i] | string | Which phase was active when each attempt ended/failed. Enables funnel chart: where do failures cluster? | HIGH |
| phase | Top-level stats (live + final log) | string | Current phase visible in getInitialSyncProgress() while sync is running; preserved in the final log entry. | HIGH |
| totalAttempts | Top-level stats | int | Total attempt count (succeeded + failed). Simplifies Atlas queries; no need to count array length. | MEDIUM |
| bytesPerSecond | Top-level stats | double | Normalized cloning throughput. Directly answers "how fast is cloning?" without manual computation. | MEDIUM |
| nodeUptimeSecs | Top-level stats | int | Node uptime at sync completion. Very short uptime relative to sync duration hints at a crash mid-sync, correlating with index rebuild or recovery issues. | MEDIUM |
Phase Values (in order)
startup → choosingSyncSource → truncatingOplog → checkingRollback → beginFetchingOpTime → gettingFCV → cloningDatabases → gettingStopTimestamp → multiApplier → rollbackCheckerCheckForRollback
Fields Out of Scope
| Field | Reason |
|---|---|
| Replication lag during sync | Requires periodic sampling of sync source lastApplied vs local lastApplied; no existing hook in InitialSyncer. |
| Frequency of restarts during sync | Requires a cross-boot persistent counter; out of scope for a single syncer instance. |
| Index build memory during sync | Lives in IndexBuildsCoordinator; not accessible to InitialSyncer. |
| Crashes per unit time | Requires durable cross-boot storage; out of scope. |
Example: New Log 21192 Output
{
"method": "logical",
"failedInitialSyncAttempts": 4,
"maxFailedInitialSyncAttempts": 10,
"totalAttempts": 5,
"totalInitialSyncElapsedMillis": 12862496,
"phase": "rollbackCheckerCheckForRollback",
"nodeUptimeSecs": 47,
"approxTotalDataSize": 1556594951,
"approxTotalBytesCopied": 1555009768,
"bytesPerSecond": 120942.8,
"appliedOps": 6,
"databases": { "databasesCloned": 3 },
"initialSyncAttempts": [
{
"durationMillis": 0,
"status": "InitialSyncOplogSourceMissing: No valid sync source found...",
"syncSource": ":27017",
"operationsRetried": 0,
"totalTimeUnreachableMillis": 0,
"phase": "choosingSyncSource"
},
{
"durationMillis": 0,
"status": "InitialSyncOplogSourceMissing: No valid sync source found...",
"syncSource": ":27017",
"operationsRetried": 0,
"totalTimeUnreachableMillis": 0,
"phase": "choosingSyncSource"
},
{
"durationMillis": 0,
"status": "HostUnreachable: ...short read",
"syncSource": "atlas",
"operationsRetried": 0,
"totalTimeUnreachableMillis": 0,
"phase": "checkingRollback"
},
{
"durationMillis": 12777492,
"status": "InvalidSyncSource: ...Upstream node rolled back...",
"syncSource": "atlas",
"operationsRetried": 23475,
"rollBackId": 2,
"totalTimeUnreachableMillis": 12751131,
"phase": "cloningDatabases"
},
{
"durationMillis": 46922,
"status": "OK",
"syncSource": "atlas",
"operationsRetried": 1,
"rollBackId": 1,
"totalTimeUnreachableMillis": 3,
"phase": "rollbackCheckerCheckForRollback"
}
]
}
Reading this example:
- Attempts 0–1 failed at choosingSyncSource — no valid sync source available yet
- Attempt 2 failed at checkingRollback — network instability reaching the primary
- Attempt 3 failed at cloningDatabases — sync source rolled back mid-clone (~3.5 hours wasted)
- Attempt 4 succeeded; nodeUptimeSecs: 47 suggests the node restarted at some point during the process
- bytesPerSecond: 120942.8 (~121 KB/s) gives the normalized cloning rate