[SERVER-57671] Inital Sync fails when collections are renamed or dropped Created: 13/Jun/21  Updated: 27/Oct/23  Resolved: 30/Jun/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Wernfried Domscheit Assignee: Wenbin Zhu
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Mongo 4.4.3 Community Edition
Running on Redhat Linux


Issue Links:
Related
related to SERVER-58164 When group insert fails, the error ty... Closed
Operating System: ALL
Sprint: Repl 2021-07-12
Participants:

 Description   

I have sharded cluster for a rather big application which inserts around 15'000 document every second.

db.getCollection('sessions').insertOne({ t: ISODate(), some more fields ...}) // 10'000-20'000 inserts per second!

Once per hour I run a bucketing operation. In principle it looks like this:

db.getCollection('sessions').renameCollection('sessions.temp');
db.getCollection('sessions').createIndexes([{ t: 1 }], {}, 1);
db.getCollection('sessions.temp').aggregate([
   { $group: ... }
   { $out: "sessions.temp.stats" }
]);
db.getCollection('sessions.temp.stats').aggregate([
   // ...
   { $merge: { into: { db: "data", coll: "session.statistics" } } }
]);
db.getCollection('sessions.temp').aggregate([
   { $group: ... }
   { $merge: { into: { db: "data", coll: "sessions.20210613" } } }
]);
db.getCollection('sessions.temp.stats').drop({ writeConcern: { w: 0, wtimeout: 60000 } })
db.getCollection('sessions.temp').drop({ writeConcern: { w: 0, wtimeout: 60000 } })

When I drop replica set member and restart then an inital sync starts as expected. The inital sync takes around 8 hours, i.e. while inital sync is running above bucketing job runs (without any problems).

However, when all databases are cloned (rs.status() states "databases: {databasesToClone: 0, databasesCloned: 9 ...") then I get thousands errors of

Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedInserts": ...

See attached log file for more details.

I get many thousands of these errors. The disk runs out of space and MongoDB stops working!

If I disable the hourly bucketing job then the inital sync runs without any problem. So, I assume the issue is caused by dropping/renaming, re-use of collection names, etc.



 Comments   
Comment by Wernfried Domscheit [ 01/Jul/21 ]

I am using logrotate utility, however at this amount of log messages I would have to run it every 10 Minutes or so. By default it runs once per day.

Anyway, I managed to deal with this issue.

Kind Regards
Wernfried

 

 

 

Comment by Wenbin Zhu [ 30/Jun/21 ]

Hi wernfried.domscheit@sunrise.net,

Although the spamming error logs do not contain the exact error code, I think I know that happened. The error code is most likely `NamespaceNotFound` due to the renamed collections. But the log line is actually not an error for the `NamespaceNotFound` case during initial sync (here is an example that you can refer to as why some errors can be safely ignored during initial sync).

We have filed SERVER-58164 to print error code both for better debuggability and to exclude `NamespaceNotFound` error from the error logs. This will avoid confusion over a state that is not an error and prevent the log from running out of the disk space.

Some more details:

When applying operations, we group the insert operations on the same namespace and apply them in one shot, if that fails, we fall back to insert each operation individually. So basically initial sync is still working, it just prints a lot of error logs. When you stopped the mongod service, initial sync also stopped and printed a log that shows was still running without any failure:

{"t":{"$date":"2021-06-17T11:49:56.676+02:00"},"s":"I""c":"INITSYNC", "id":21192,   "ctx":"ReplCoordExtern-0","msg":"Initial Sync Attempt Statistics","attr":{"statistics":{"failedInitialSyncAttempts":0,"maxFailedInitialSyncAttempts":10,"initialSyncStart":{"$date":"2021-06-17T06:14:38.370Z"},"totalInitialSyncElapsedMillis":12918306,"initialSyncAttempts":[],"approxTotalDataSize":1001426297691.0,"approxTotalBytesCopied":998077504441.0,"remainingInitialSyncEstimatedMillis":43344,"appliedOps":3227364,"initialSyncOplogStart":{"$timestamp":{"t":1623910479,"i":5000}},"initialSyncOplogEnd":{"$timestamp":{"t":1623923308,"i":6000}},"totalTimeUnreachableMillis":0, ...

The log line above shows that initial sync didn't encounter failure and successfully applied 3227364 operations by the time it was shut down. 

To address your disk space issue, you can use the logrotate utility in Linux together with MongoDB's log rotation functionality to achieve automatic log rotation based on log size, which can protect your disk from sudden log growth.

Hope this helps and feel free to reopen if you still have problem with this. Thanks!

 

Best,

Wenbin

 

Comment by Wenbin Zhu [ 29/Jun/21 ]

There is a logging issue when group insert fails, it does not print the error code, so it's kind of hard to know what actually caused the failure. Filed SERVER-58164 to track this issue, and just put up a patch for it. I will try to see if there are other ways to see what's actually happening.

Comment by Eric Sedor [ 17/Jun/21 ]

Thanks wernfried.domscheit@sunrise.net, I'll be passing this on to an appropriate team to investigate further. Some initial notes I have are:

Initial sync starts at 2021-06-17T06:14:38.370Z

{"t":{"$date":"2021-06-17T06:14:38.370Z"},"s":"I","c":"INITSYNC","id":21164,"ctx":"ReplCoordExtern-0","msg":"Starting initial sync attempt","attr":{"initialSyncAttempt":1,"initialSyncMaxAttempts":10}}

pmado01mipmed0.sessions#1 is renamed (and also created via rename) several times throughout the process.

The first "Error applying inserts in bulk error occurs for pmado01mipmed0.sessions#1 at 2021-06-17T09:48:28.422Z:

{"t":{"$date":"2021-06-17T09:48:28.422Z"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-9","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedInserts":{"ts":[{"_timestamp":{"t":1623910479,"i":9021}},...

The first timestamp in groupedInserts translates to 2021-06-17T06:14:39.000Z, about the time of the start of the initial sync.

Comment by Wernfried Domscheit [ 17/Jun/21 ]

Hi

I uploaded the log files as requested.

When the secondary threw around 50'000 errors, I stopped the mongod service. Otherwise my /var/log/ folder runs out of space. I don't know whether mongo would finally succeed the inital sync, provided there is plenty of space left for the log file.

Kind Regards
Wernfried

Comment by Eric Sedor [ 16/Jun/21 ]

Hi wernfried.domscheit@sunrise.net, we'd be happy to take a look.

I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

For the primary node and an initially syncing secondary, for a time period spanning the start of initial sync and a disk space-related failure, can you please archive (tar or zip) and upload to that link:

  • the full mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Gratefully,
Eric

Comment by Wernfried Domscheit [ 14/Jun/21 ]

Here some more complete logs:

 

$ cat mongod-INITSYNC-REPL.log |  while read line; do echo "$line" | cut -c-200 ; done
{"t":{"$date":"2021-06-14T08:05:41.262+02:00"},"s":"I","c":"REPL","id":21311,"ctx":"initandlisten","msg":"Did not find local initialized voted for document at startup"}
{"t":{"$date":"2021-06-14T08:05:41.262+02:00"},"s":"I","c":"REPL","id":21312,"ctx":"initandlisten","msg":"Did not find local Rollback ID document at startup. Creating one"}
{"t":{"$date":"2021-06-14T08:05:41.266+02:00"},"s":"I","c":"REPL","id":21531,"ctx":"initandlisten","msg":"Initialized the rollback ID","attr":{"rbid":1}}
{"t":{"$date":"2021-06-14T08:05:41.266+02:00"},"s":"I","c":"REPL","id":21313,"ctx":"initandlisten","msg":"Did not find local replica set configuration document at startup","attr":{"error":{"code":47,"
{"t":{"$date":"2021-06-14T08:05:41.267+02:00"},"s":"I","c":"REPL","id":40440,"ctx":"initandlisten","msg":"Starting the TopologyVersionObserver"}
{"t":{"$date":"2021-06-14T08:05:41.267+02:00"},"s":"I","c":"REPL","id":40445,"ctx":"TopologyVersionObserver","msg":"Started TopologyVersionObserver"}
{"t":{"$date":"2021-06-14T08:05:41.805+02:00"},"s":"I","c":"REPL","id":21400,"ctx":"conn14","msg":"Scheduling heartbeat to fetch a new config since we are not a member of our current config","attr":{"
{"t":{"$date":"2021-06-14T08:05:41.986+02:00"},"s":"I","c":"REPL","id":21392,"ctx":"ReplCoord-0","msg":"New replica set config in use","attr":{"config":{"_id":"shard_04","version":4,"term":3,"protocol
{"t":{"$date":"2021-06-14T08:05:41.986+02:00"},"s":"I","c":"REPL","id":21393,"ctx":"ReplCoord-0","msg":"Found self in config","attr":{"hostAndPort":"d-mipmdb-sh1-04.swi.srse.net:27018"}}
{"t":{"$date":"2021-06-14T08:05:41.986+02:00"},"s":"I","c":"REPL","id":21358,"ctx":"ReplCoord-0","msg":"Replica set state transition","attr":{"newState":"STARTUP2","oldState":"STARTUP"}}
{"t":{"$date":"2021-06-14T08:05:41.986+02:00"},"s":"I","c":"REPL","id":21306,"ctx":"ReplCoord-0","msg":"Starting replication storage threads"}
{"t":{"$date":"2021-06-14T08:05:41.987+02:00"},"s":"I","c":"REPL","id":21215,"ctx":"ReplCoord-3","msg":"Member is in new state","attr":{"hostAndPort":"d-mipmdb-sh2-04.swi.srse.net:27018","newState":"P
{"t":{"$date":"2021-06-14T08:05:41.995+02:00"},"s":"I","c":"INITSYNC","id":21164,"ctx":"ReplCoordExtern-0","msg":"Starting initial sync attempt","attr":{"initialSyncAttempt":1,"initialSyncMaxAttempts"
{"t":{"$date":"2021-06-14T08:05:42.000+02:00"},"s":"I","c":"REPL","id":21783,"ctx":"ReplCoordExtern-0","msg":"Waiting for pings from other members before syncing","attr":{"pingsNeeded":3}}
{"t":{"$date":"2021-06-14T08:05:42.060+02:00"},"s":"I","c":"REPL","id":21215,"ctx":"ReplCoord-2","msg":"Member is in new state","attr":{"hostAndPort":"d-mipmdb-arb-01.swi.srse.net:28004","newState":"A
{"t":{"$date":"2021-06-14T08:05:43.000+02:00"},"s":"I","c":"REPL","id":3873117,"ctx":"ReplCoordExtern-1","msg":"Choosing primary as sync source","attr":{"primary":"d-mipmdb-sh2-04.swi.srse.net:27018"}
{"t":{"$date":"2021-06-14T08:05:43.000+02:00"},"s":"I","c":"INITSYNC","id":21173,"ctx":"ReplCoordExtern-1","msg":"Initial syncer oplog truncation finished","attr":{"durationMillis":0}}
{"t":{"$date":"2021-06-14T08:05:43.000+02:00"},"s":"I","c":"REPL","id":21251,"ctx":"ReplCoordExtern-1","msg":"Creating replication oplog","attr":{"oplogSizeMB":51200}}
{"t":{"$date":"2021-06-14T08:05:43.052+02:00"},"s":"I","c":"REPL","id":21754,"ctx":"ReplCoordExtern-1","msg":"dropReplicatedDatabases - dropping databases","attr":{"numDatabases":1}}
{"t":{"$date":"2021-06-14T08:05:43.052+02:00"},"s":"I","c":"REPL","id":21756,"ctx":"ReplCoordExtern-1","msg":"dropReplicatedDatabases - dropped databases","attr":{"numDatabases":1}}
{"t":{"$date":"2021-06-14T08:05:43.303+02:00"},"s":"I","c":"REPL","id":4855903,"ctx":"ReplCoordExtern-1","msg":"Oplog fetcher retrying because FCV has changed"}
{"t":{"$date":"2021-06-14T11:54:23.518+02:00"},"s":"I","c":"INITSYNC","id":21183,"ctx":"ReplCoordExtern-0","msg":"Finished cloning data. Beginning oplog replay","attr":{"databaseClonerFinishStatus":"O
{"t":{"$date":"2021-06-14T11:54:23.615+02:00"},"s":"I","c":"INITSYNC","id":21195,"ctx":"ReplCoordExtern-3","msg":"Writing to the oplog and applying operations until stopTimestamp before initial sync c
{"t":{"$date":"2021-06-14T11:54:24.018+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-6","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:54:24.019+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-2","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:54:24.019+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-7","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:54:24.019+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-5","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:54:24.019+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-4","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:54:24.019+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-10","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIn
{"t":{"$date":"2021-06-14T11:54:24.020+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-3","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
// many more logs like this
{"t":{"$date":"2021-06-14T11:55:25.604+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-0","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:55:25.605+02:00"},"s":"I","c":"REPL","id":40447,"ctx":"TopologyVersionObserver","msg":"Stopped TopologyVersionObserver"}
{"t":{"$date":"2021-06-14T11:55:25.606+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-2","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:55:25.606+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-14","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIn
{"t":{"$date":"2021-06-14T11:55:25.610+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-4","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:55:25.611+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-8","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:55:25.612+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-5","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:55:25.613+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-15","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIn
{"t":{"$date":"2021-06-14T11:55:25.613+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-7","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:55:25.614+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-1","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:55:25.616+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-12","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIn
{"t":{"$date":"2021-06-14T11:55:25.616+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-2","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:55:25.617+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-14","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIn
{"t":{"$date":"2021-06-14T11:55:25.619+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-5","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:55:25.621+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-2","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:55:25.621+02:00"},"s":"E","c":"REPL","id":21204,"ctx":"ReplWriterWorker-6","msg":"Error applying inserts in bulk. Trying first insert as a lone insert","attr":{"groupedIns
{"t":{"$date":"2021-06-14T11:55:25.622+02:00"},"s":"I","c":"REPL","id":4784909,"ctx":"SignalHandler","msg":"Shutting down the ReplicationCoordinator"}
{"t":{"$date":"2021-06-14T11:55:25.622+02:00"},"s":"I","c":"REPL","id":21328,"ctx":"SignalHandler","msg":"Shutting down replication subsystems"}
{"t":{"$date":"2021-06-14T11:55:25.624+02:00"},"s":"E","c":"INITSYNC","id":21199,"ctx":"ReplCoordExtern-3","msg":"Failed to apply batch","attr":{"error":"CallbackCanceled: error applying batch: initia
{"t":{"$date":"2021-06-14T11:55:25.625+02:00"},"s":"I","c":"INITSYNC","id":21181,"ctx":"ReplCoordExtern-1","msg":"Finished fetching oplog during initial sync","attr":{"oplogFetcherFinishStatus":"Callb
{"t":{"$date":"2021-06-14T11:55:25.625+02:00"},"s":"I","c":"INITSYNC","id":21191,"ctx":"ReplCoordExtern-1","msg":"Initial sync attempt finishing up"}
{"t":{"$date":"2021-06-14T11:55:25.625+02:00"},"s":"I","c":"INITSYNC","id":21192,"ctx":"ReplCoordExtern-1","msg":"Initial Sync Attempt Statistics","attr":{"statistics":{"failedInitialSyncAttempts":0,"
{"t":{"$date":"2021-06-14T11:55:25.625+02:00"},"s":"E","c":"INITSYNC","id":21200,"ctx":"ReplCoordExtern-1","msg":"Initial sync attempt failed","attr":{"attemptsLeft":9,"error":"CallbackCanceled: error
{"t":{"$date":"2021-06-14T11:55:25.907+02:00"},"s":"I","c":"REPL","id":21324,"ctx":"ReplCoordExtern-1","msg":"Initial Sync has been cancelled","attr":{"error":{"code":90,"codeName":"CallbackCanceled",
{"t":{"$date":"2021-06-14T11:55:25.910+02:00"},"s":"I","c":"REPL","id":21307,"ctx":"SignalHandler","msg":"Stopping replication storage threads"}
{"t":{"$date":"2021-06-14T11:55:25.916+02:00"},"s":"I","c":"REPL","id":4784911,"ctx":"SignalHandler","msg":"Enqueuing the ReplicationStateTransitionLock for shutdown"}
{"t":{"$date":"2021-06-14T11:55:25.916+02:00"},"s":"I","c":"REPL","id":4784914,"ctx":"SignalHandler","msg":"Acquiring the ReplicationStateTransitionLock for shutdown"}
{"t":{"$date":"2021-06-14T11:55:25.916+02:00"},"s":"I","c":"REPL","id":4784916,"ctx":"SignalHandler","msg":"Reacquiring the ReplicationStateTransitionLock for shutdown"}
{"t":{"$date":"2021-06-14T11:55:25.916+02:00"},"s":"I","c":"REPL","id":4784917,"ctx":"SignalHandler","msg":"Attempting to mark clean shutdown"}
{"t":{"$date":"2021-06-14T11:55:25.919+02:00"},"s":"I","c":"REPL","id":4784920,"ctx":"SignalHandler","msg":"Shutting down the LogicalTimeValidator"}

It also confirms, this problem is reproducible.

[^mongod-INITSYNC-REPL.log]

 

 

Comment by Wernfried Domscheit [ 14/Jun/21 ]

Hi dmitry.agranat 

No, there are no capped collections in my application.

Comment by Dmitry Agranat [ 14/Jun/21 ]

Hi wernfried.domscheit@sunrise.net, are any of the collections in question are capped collections?

Generated at Thu Feb 08 05:42:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.