[SERVER-16724] Startup time under WiredTiger is extremely slow and gets worse as database grows Created: 05/Jan/15  Updated: 15/Jan/15  Resolved: 13/Jan/15

Status: Closed
Project: Core Server
Component/s: Performance, Storage
Affects Version/s: 2.8.0-rc3
Fix Version/s: 2.8.0-rc5

Type: Bug Priority: Major - P3
Reporter: Nathan Arthur Assignee: Mark Benvenuto
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File additional_startup_time_logs.txt     Text File db_details_commands.txt     Text File db_details_output.txt     Text File mongodb_startup_wiredtiger_slow.txt     Text File more_shutdown_logs.txt    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

I did a test of my company's analytics app, using the WiredTiger storage engine, on Windows Server 2012, on a HyperV VM. During the test, I noticed that startup time for MongoDB was getting excessive - 2 hours in one case, 2.5 hours in another. It had been very quick the first few times I started it. I didn't take a lot of time to log the details of what was wrong, but the difference between the various times was only in the amount of data in the database - the more data, the slower the startup time. (Roughly 6MM records -> 2 hours, roughly 7.5MM -> 2.5 hours.)

I don't have stats from those instances, but one of the databases in the instance is a single collection with one record per incoming message, stored as binary data, with a few fields of metadata and a few indexes. I assume it's that database that's causing the problem. (During the entire startup period, the system shows lots of read activity for a single huge file from the mongodb data directory.) The other databases have much lower overall volume/size.

Sometime in the next day or so I'll need to restart it again, with about 20MM records in it. I'll collect more detailed info about it, then.

I've attached log snippets from the two slow startups.



 Comments   
Comment by Nathan Arthur [ 13/Jan/15 ]

Fantastic - thank you!

Comment by Mark Benvenuto [ 13/Jan/15 ]

I believe I was able to create a repro of your bug by doing the following:

  1. Start server with WT
  2. Insert documents into server for a while (I had 10 GB)
  3. Terminate MongoD process. I used kill.exe to terminate the MongoD process abruptly, Control-C would have done a clean shutdown
  4. Start MongoD again, and MongoD startup takes a long time (I stopped it after waiting for 10 minutes).

I fix has been pushed to master, and I believe this will resolve the issue you saw.

Comment by Githook User [ 13/Jan/15 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-16724: fix syncSizeStorer
Branch: master
https://github.com/mongodb/mongo/commit/60dc6bf36715570fbf0287d83c67e60b5f408baa

Comment by Nathan Arthur [ 13/Jan/15 ]

This seems like a plausible explanation, but there shouldn't have been any new collections created during these runs. The server had been started and restarted multiple times, and was even restarted after it already had lots of data, multiple times. The only way I can think that this would make sense is if every single shutdown was failing, and the problem propagated across starts - but that seems unlikely.

Comment by Mark Benvenuto [ 12/Jan/15 ]

In this scenario, due to SERVER-16106, the server was abruptly terminated.

If the server is abruptly terminated during a session when a collection has just been created, then the server does not go through a clean shutdown. On start up, WiredTigerRecordStore::WiredTigerRecordStore counts every record in the collection, and gets the size of each record.

Comment by Nathan Arthur [ 06/Jan/15 ]

I attached a new file with a bunch of shutdown logs. I didn't see any of those messages you listed, in any of them. (Even ones where the subsequent startup was fast.) Are you sure about that logging, on Windows, on RC3?

Note that in the slow startup scenarios, nothing was logged about it doing a recovery - it just said it was starting up, then hung for hours, then started accepting connections.

I also just tried hard-killing the mongod.exe process, and it started up immediately again.

Comment by Mark Benvenuto [ 06/Jan/15 ]

In your attachment "additional_startup_time_logs.txt", do you trim any of the output from "PRIOR SHUTDOWN". I expect something like the following:

2015-01-06T17:00:20.050-0500 I NETWORK  [serviceShutdown] shutdown: going to close sockets...
2015-01-06T17:00:20.050-0500 I STORAGE  [serviceShutdown] WiredTigerKVEngine shutting down
2015-01-06T17:00:20.101-0500 I STORAGE  [serviceShutdown] shutdown: removing fs lock...
2015-01-06T17:00:20.101-0500 I COMMAND  [serviceShutdown] dbexit:  rc: 49

If you did not trim the shutdown log output, this means that Windows completed the shutdown before MongoDB had time to finish cleanly shutting down. This means Windows killed mongod.exe, and WiredTiger then had to go through recovery on restart.

I will work in SERVER-16106 to improve our Shutdown handler, and ensure WaitToKillServiceTimeout is set correctly.

Comment by Nathan Arthur [ 06/Jan/15 ]

...and of course now it starts up essentially instantly. Even after multiple restarts. I have no idea why.

There have been NO configuration changes to this host; the only difference is a lot more data imported (via application code), and a lot more time passed.

I found a third log (from before) where startup time was around 4 hours. I don't remember the database size at the time, but it would have been correspondingly larger then the earlier examples. I'll attach that, as well as the log snippit from the prior shutdown (which was perfectly normal).

Comment by Nathan Arthur [ 06/Jan/15 ]

I'm about to restart the server again. I collected a bunch of stats, which I'll upload as attachments (one with commands, one with the output). I'll report back later with the total restart time.

Comment by Daniel Pasette (Inactive) [ 06/Jan/15 ]

Hi Nathan, thanks for the report. Can you run db.stats() and db.collection.stats() (substituting your collection name here) and add to this ticket?

Also, how many total databases and collections do you have on this instance?

Comment by Nathan Arthur [ 05/Jan/15 ]

I just noticed from the logs - apparently this was actually RC3, not RC0 as I tagged it.

Generated at Thu Feb 08 03:42:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.