Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- initialSync

Assigned Teams:

Server Triage
Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We have a sharded cluster with one shard, Community version 4.4.
One node has 10TB of data.

We are trying to add a node, and during the initial sync process, we encounter an error every time. Here are the most interesting lines from the log:

{"t":\{"$date":"2026-03-31T07:10:54.573+03:00"},"s":"I",  "c":"STORAGE",  "id":20674,   "ctx":"initandlisten","msg":"Index builds manager completed successfully","attr":\{"buildUUID":{"uuid":{"$uuid":"dd170a9e-3352-4fb1-ae32-ee931416df8d"}},"namespace":"DOCSTORE_CONTENT.CONTENT_DOC_ADD.chunks","indexSpecsRequested":2,"numIndexesBefore":2,"numIndexesAfter":2}}
{"t":\{"$date":"2026-03-31T07:10:55.007+03:00"},"s":"I",  "c":"FTDC",     "id":20631,   "ctx":"ftdc","msg":"Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost","attr":\{"error":{"code":0,"codeName":"OK"}}}

{"t":\{"$date":"2026-03-31T07:10:54.840+03:00"},"s":"W",  "c":"SHARDING", "id":22668,   "ctx":"replSetDistLockPinger","msg":"Pinging failed for distributed lock pinger","attr":\{"error":{"code":129,"codeName":"LockStateChangeFailed","errmsg":"findAndModify query predicate didn't match any lock document"}}}
{"t":\{"$date":"2026-03-31T07:10:55.007+03:00"},"s":"I",  "c":"FTDC",     "id":20631,   "ctx":"ftdc","msg":"Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost","attr":\{"error":{"code":0,"codeName":"OK"}}}
{"t":\{"$date":"2026-03-31T07:10:55.391+03:00"},"s":"I",  "c":"REPL",     "id":4280509, "ctx":"ReplCoord-0","msg":"Local configuration validated for startup"}
{"t":\{"$date":"2026-03-31T07:10:55.391+03:00"},"s":"W",  "c":"REPL",     "id":21407,   "ctx":"ReplCoord-0","msg":"Failed to load timestamp and/or wall clock time of most recently applied operation","attr":\{"error":{"code":113,"codeName":"InitialSyncFailure","errmsg":"In the middle of an initial sync."}}}
{"t":\{"$date":"2026-03-31T07:10:55.391+03:00"},"s":"I",  "c":"REPL",     "id":6015317, "ctx":"ReplCoord-0","msg":"Setting new configuration state","attr":\{"newState":"ConfigSteady","oldState":"ConfigStartingUp"}}

We also noticed from the logs that the issue occurs after a collection of 2.5TB has been copied over and its indexes (two of them, which are small in size) have been created. We have no issues with network connectivity or disk space. We have tried adjusting parameters such as maxIndexBuildMemoryUsageMegabytes and wiredTiger.cache, and we have also changed the sync source. The oplog is retained for 10 days.

What can we do to successfully add a node to the cluster? We encounter these errors every time, and the process restarts from scratch. We also noticed that memory consumption is very high — the server has 57GB of RAM available, and MongoDB consumes almost all of it, which is not always sufficient.

is related to

SERVER-121174 Improve detail on initial sync log messages

Closed

SERVER-82037 Memory used by sorter spills can grow without bound

Closed

SERVER-111885 Index Build Merging Spills conflicts with replSetGetStatus

Closed

SERVER-86591 Enhance InitialSync Error Handling for Transient Network Issues

Closed

SERVER-107806 Capture Detailed Logical Initial Sync Metrics

Closed

related to

SERVER-126802 Initial sync fail "Broken pipe"

Closed

(1 related to)

Assignee:: Chris Kelly
Reporter:: Asel Magzh
Participants:: Asel Magzh, Chris Kelly
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Mar 31 2026 04:37:12 PM UTC
Updated:: May 15 2026 12:28:34 PM UTC
Resolved:: Apr 02 2026 07:00:35 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates

PagerDuty