-
Type:
Bug
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
ALL
-
None
-
None
-
None
-
None
-
None
-
None
-
None
We have a sharded cluster with one shard, Community version 4.4.
One node has 10TB of data.
We are trying to add a node, and during the initial sync process, we encounter an error every time. Here are the most interesting lines from the log:
{"t":\{"$date":"2026-03-31T07:10:54.573+03:00"},"s":"I", "c":"STORAGE", "id":20674, "ctx":"initandlisten","msg":"Index builds manager completed successfully","attr":\{"buildUUID":{"uuid":{"$uuid":"dd170a9e-3352-4fb1-ae32-ee931416df8d"}},"namespace":"DOCSTORE_CONTENT.CONTENT_DOC_ADD.chunks","indexSpecsRequested":2,"numIndexesBefore":2,"numIndexesAfter":2}}
{"t":\{"$date":"2026-03-31T07:10:55.007+03:00"},"s":"I", "c":"FTDC", "id":20631, "ctx":"ftdc","msg":"Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost","attr":\{"error":{"code":0,"codeName":"OK"}}}
{"t":\{"$date":"2026-03-31T07:10:54.840+03:00"},"s":"W", "c":"SHARDING", "id":22668, "ctx":"replSetDistLockPinger","msg":"Pinging failed for distributed lock pinger","attr":\{"error":{"code":129,"codeName":"LockStateChangeFailed","errmsg":"findAndModify query predicate didn't match any lock document"}}}
{"t":\{"$date":"2026-03-31T07:10:55.007+03:00"},"s":"I", "c":"FTDC", "id":20631, "ctx":"ftdc","msg":"Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been lost","attr":\{"error":{"code":0,"codeName":"OK"}}}
{"t":\{"$date":"2026-03-31T07:10:55.391+03:00"},"s":"I", "c":"REPL", "id":4280509, "ctx":"ReplCoord-0","msg":"Local configuration validated for startup"}
{"t":\{"$date":"2026-03-31T07:10:55.391+03:00"},"s":"W", "c":"REPL", "id":21407, "ctx":"ReplCoord-0","msg":"Failed to load timestamp and/or wall clock time of most recently applied operation","attr":\{"error":{"code":113,"codeName":"InitialSyncFailure","errmsg":"In the middle of an initial sync."}}}
{"t":\{"$date":"2026-03-31T07:10:55.391+03:00"},"s":"I", "c":"REPL", "id":6015317, "ctx":"ReplCoord-0","msg":"Setting new configuration state","attr":\{"newState":"ConfigSteady","oldState":"ConfigStartingUp"}}
We also noticed from the logs that the issue occurs after a collection of 2.5TB has been copied over and its indexes (two of them, which are small in size) have been created. We have no issues with network connectivity or disk space. We have tried adjusting parameters such as maxIndexBuildMemoryUsageMegabytes and wiredTiger.cache, and we have also changed the sync source. The oplog is retained for 10 days.
What can we do to successfully add a node to the cluster? We encounter these errors every time, and the process restarts from scratch. We also noticed that memory consumption is very high — the server has 57GB of RAM available, and MongoDB consumes almost all of it, which is not always sufficient.
- is related to
-
SERVER-121174 Improve detail on initial sync log messages
-
- Closed
-
-
SERVER-82037 Memory used by sorter spills can grow without bound
-
- Closed
-
-
SERVER-111885 Index Build Merging Spills conflicts with replSetGetStatus
-
- Closed
-
-
SERVER-107806 Capture Detailed Logical Initial Sync Metrics
-
- In Progress
-
-
SERVER-86591 Enhance InitialSync Error Handling for Transient Network Issues
-
- Closed
-