[SERVER-50067] Initial sync attempt failed multiple times and throw MaxTimeMSExpired error Created: 01/Aug/20  Updated: 27/Oct/23  Resolved: 09/Aug/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Zhuoran Dong Assignee: Dmitry Agranat
Resolution: Community Answered Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

From the log below, the initial sync attempt failed multiple times by the same MaxTimeMSExpired error and shut down finally. Could you help me to find the reason for this?

2020-07-27T21:47:45.305+0000 E - [replication-11] Assertion: BSONObjectTooLarge: BSONObj size: 36469414 (0x22C7AA6) is invalid. Size must be between 0 and 16793600(16MB) First element: failedInitialSyncAttempts: 9 src/mongo/bson/bsonobj.cpp 98
2020-07-27T21:47:45.305+0000 I INITSYNC [replication-11] Error creating initial sync progress object: BSONObjectTooLarge: BSONObj size: 36469414 (0x22C7AA6) is invalid. Size must be between 0 and 16793600(16MB) First element: failedInitialSyncAttempts: 9
2020-07-27T21:47:45.305+0000 I INITSYNC [replication-11] Initial Sync Attempt Statistics: { failedInitialSyncAttempts: 9, maxFailedInitialSyncAttempts: 10, initialSyncStart: new Date(1595879408073), initialSyncAttempts: [

Unknown macro: { durationMillis}

, { durationMillis: 1449358, status: "MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback :: caused by :: operation exceeded time limit", syncSource: "infra1.sd.tusimple.ai:27017" }, { durationMillis: 82011, status: "NetworkInterfaceExceededTimeLimit: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback :: caused by :: Request 48...", syncSource: "infra1.sd.tusimple.ai:27017" }, { durationMillis: 1268589, status: "MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback :: caused by :: operation exceeded time limit", syncSource: "infra1.sd.tusimple.ai:27017" }, { durationMillis: 100113, status: "MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback :: caused by :: operation exceeded time limit", syncSource: "infra1.sd.tusimple.ai:27017" }, { durationMillis: 109455, status: "MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback :: caused by :: operation exceeded time limit", syncSource: "infra1.sd.tusimple.ai:27017" }, { durationMillis: 132139, status: "MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback :: caused by :: operation exceeded time limit", syncSource: "infra1.sd.tusimple.ai:27017" }, { durationMillis: 120688, status: "MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback :: caused by :: operation exceeded time limit", syncSource: "infra1.sd.tusimple.ai:27017" }, { durationMillis: 81710, status: "MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback :: caused by :: operation exceeded time limit", syncSource: "infra1.sd.tusimple.ai:27017" } ], fetchedMissingDocs: 0, appliedOps: 0, initialSyncOplogStart: Timestamp(1595884194, 9), initialSyncOplogFetchingStart: Timestamp(1595884060, 1) }
2020-07-27T21:47:45.305+0000 E INITSYNC [replication-11] Initial sync attempt failed – attempts left: 0 cause: MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback :: caused by :: operation exceeded time limit
2020-07-27T21:47:45.305+0000 F INITSYNC [replication-11] The maximum number of retries have been exhausted for initial sync.
2020-07-27T21:47:45.305+0000 I STORAGE [replication-11] Finishing collection drop for local.temp_oplog_buffer (11ccdcf7-894f-4b8e-9323-2f354c26cbc4).
2020-07-27T21:47:45.307+0000 E REPL [replication-11] Initial sync failed, shutting down now. Restart the server to attempt a new initial sync.
2020-07-27T21:47:45.307+0000 F - [replication-11] Fatal assertion 40088 MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback :: caused by :: operation exceeded time limit at src/mongo/db/repl/replication_coordinator_impl.cpp 745
2020-07-27T21:47:45.307+0000 F - [replication-11]

***aborting after fassert() failure



 Comments   
Comment by Zhuoran Dong [ 03/Aug/20 ]

Got it. Thank you.

Comment by Dmitry Agranat [ 03/Aug/20 ]

Hi zhuoran.dong@tusimple.ai,

The reason for the failing initial sync is due to inability to replicate data which BSON size is larger than the maximum 16 MB. I am not sure if you have multiple objects like this but some are 45 MB, for example: BSONObjectTooLarge: BSONObj size: 47724603. I recommend fixing the BSONObj size issue in your cluster and retrying initial sync.

Thanks,
Dima

Comment by Zhuoran Dong [ 03/Aug/20 ]

Hi Dmitry, the mongod.log and diagnostic.data directory are uploaded.

Comment by Dmitry Agranat [ 02/Aug/20 ]

Hi zhuoran.dong@tusimple.ai, so that we can investigate these issues further, would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory covering one or more of these incidents (the contents are described here) and upload them to this [support uploader|https://10gen-httpsupload.s3.amazonaws.com/upload_forms/91f42bb8-1a8e-43f8-8e71-2be4d739ba43.html location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Generated at Thu Feb 08 05:21:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.