-
Type:
Investigation
-
Resolution: Won't Fix
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Tools and Replicator
Adds new fields to the initial sync serverStatus section and log message.:
The replSetGetStatus / serverStatus initial sync progress document gains several new fields:
Field Level Description
phase top-level Current phase string (e.g. "cloningData")
totalAttempts Stats Total number of sync attempts so far
bytesPerSecond Stats Approximate copy throughput (bytes/sec) derived from approxTotalBytesCopied and elapsed time
Each entry in the initialSyncAttempts array now also includes:
Field Description
phase Phase the attempt was in when it ended (or "complete" on success)
nodeUptimeSecs Node wall-clock uptime at the time the attempt finished
phaseDurations Array of
objects covering every phase the attempt went through
Description of Linked Ticket
We currently log initial sync metrics via log 21192 at the end of each sync:
{
"method": "logical",
"failedInitialSyncAttempts": 4,
"maxFailedInitialSyncAttempts": 10,
"approxTotalBytesCopied": 1555009768,
"approxTotalDataSize": 1556594951,
"databases": { "databasesCloned": 3 },
"appliedOps": 6,
"totalInitialSyncElapsedMillis": 12862496,
"initialSyncAttempts": [
{
"durationMillis": 0,
"operationsRetried": 0,
"status": "InitialSyncOplogSourceMissing: No valid sync source found...",
"syncSource": ":27017",
"totalTimeUnreachableMillis": 0
},
// ... more attempts ...
]
}
We filter Atlas ingestion on "attr.status": "successful", so we only collect this from syncs that ultimately succeed. We can remove this ingestion filter and start recording failed syncs. To get more value out of these logs we can add the following fields:
| Field | Location | Type | Purpose | Priority |
|---|---|---|---|---|
| phase | initialSyncAttempts[i] | string | Which phase was active when each attempt ended/failed. Enables funnel chart: where do failures cluster? | HIGH |
| phase | Top-level stats (live + final log) | string | Current phase visible in getInitialSyncProgress() while sync is running; preserved in the final log entry. | HIGH |
| totalAttempts | Top-level stats | int | Total attempt count (succeeded + failed). Simplifies Atlas queries; no need to count array length. | MEDIUM |
| bytesPerSecond | Top-level stats | double | Normalized cloning throughput. Directly answers "how fast is cloning?" without manual computation. | MEDIUM |
| nodeUptimeSecs | Top-level stats | int | Node uptime at sync completion. Very short uptime relative to sync duration hints at a crash mid-sync, correlating with index rebuild or recovery issues. | MEDIUM |
Phase Values (in order)
startup → choosingSyncSource → truncatingOplog → checkingRollback → beginFetchingOpTime → gettingFCV → cloningDatabases → gettingStopTimestamp → multiApplier → rollbackCheckerCheckForRollback
Fields Out of Scope
| Field | Reason |
|---|---|
| Replication lag during sync | Requires periodic sampling of sync source lastApplied vs local lastApplied; no existing hook in InitialSyncer. |
| Frequency of restarts during sync | Requires a cross-boot persistent counter; out of scope for a single syncer instance. |
| Index build memory during sync | Lives in IndexBuildsCoordinator; not accessible to InitialSyncer. |
| Crashes per unit time | Requires durable cross-boot storage; out of scope. |
Example: New Log 21192 Output
{
"method": "logical",
"failedInitialSyncAttempts": 4,
"maxFailedInitialSyncAttempts": 10,
"totalAttempts": 5,
"totalInitialSyncElapsedMillis": 12862496,
"phase": "rollbackCheckerCheckForRollback",
"nodeUptimeSecs": 47,
"approxTotalDataSize": 1556594951,
"approxTotalBytesCopied": 1555009768,
"bytesPerSecond": 120942.8,
"appliedOps": 6,
"databases": { "databasesCloned": 3 },
"initialSyncAttempts": [
{
"durationMillis": 0,
"status": "InitialSyncOplogSourceMissing: No valid sync source found...",
"syncSource": ":27017",
"operationsRetried": 0,
"totalTimeUnreachableMillis": 0,
"phase": "choosingSyncSource"
},
{
"durationMillis": 0,
"status": "InitialSyncOplogSourceMissing: No valid sync source found...",
"syncSource": ":27017",
"operationsRetried": 0,
"totalTimeUnreachableMillis": 0,
"phase": "choosingSyncSource"
},
{
"durationMillis": 0,
"status": "HostUnreachable: ...short read",
"syncSource": "atlas",
"operationsRetried": 0,
"totalTimeUnreachableMillis": 0,
"phase": "checkingRollback"
},
{
"durationMillis": 12777492,
"status": "InvalidSyncSource: ...Upstream node rolled back...",
"syncSource": "atlas",
"operationsRetried": 23475,
"rollBackId": 2,
"totalTimeUnreachableMillis": 12751131,
"phase": "cloningDatabases"
},
{
"durationMillis": 46922,
"status": "OK",
"syncSource": "atlas",
"operationsRetried": 1,
"rollBackId": 1,
"totalTimeUnreachableMillis": 3,
"phase": "rollbackCheckerCheckForRollback"
}
]
}
Reading this example:
- Attempts 0–1 failed at choosingSyncSource — no valid sync source available yet
- Attempt 2 failed at checkingRollback — network instability reaching the primary
- Attempt 3 failed at cloningDatabases — sync source rolled back mid-clone (~3.5 hours wasted)
- Attempt 4 succeeded; nodeUptimeSecs: 47 suggests the node restarted at some point during the process
- bytesPerSecond: 120942.8 (~121 KB/s) gives the normalized cloning rate
- depends on
-
SERVER-121174 Improve detail on initial sync log messages
-
- Closed
-