[SERVER-18887] WiredTiger instances taking 2 hours to startup and transition to Secondary state Created: 09/Jun/15  Updated: 07/Apr/23  Resolved: 03/Aug/15

Status: Closed
Project: Core Server
Component/s: Performance, WiredTiger
Affects Version/s: 3.0.0, 3.0.1, 3.0.2, 3.0.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Roger Dawson Assignee: Mark Benvenuto
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows 2012 R2 VM running in Azure, 7GB RAM


Attachments: Text File datadircontents.txt     Text File dircontents.txt     Text File journaldircontents.txt     Zip Archive mongodb_log_000001.zip     Zip Archive mongodb_log_000002.zip     Text File mongowtstartup.log    
Issue Links:
Duplicate
duplicates SERVER-19052 Remove sizeStorer recalculations at s... Closed
Related
related to DOCS-5699 Update Windows Azure disclaimer to be... Closed
Operating System: Windows
Steps To Reproduce:

Restart the Windows Server MongoDB is running on. On restart the system will take about 2 hours to get to the 'Secondary' state during which times it appears to be scanning every collection and index file (~2000 for us).

Sprint: Platform 7 08/10/15
Participants:

 Description   

WiredTiger instances are taking a long time to startup to the point where it becomes a Secondary or Primary. The log output up to where the long delay occurs before the transition to the Secondary/Primary state is shown below:

2015-06-09T00:41:34.920+0000 I CONTROL ***** SERVER RESTARTED *****
2015-06-09T00:41:35.206+0000 I CONTROL Trying to start Windows service 'MongoDB'
2015-06-09T00:41:35.224+0000 I STORAGE Service running
2015-06-09T00:41:35.254+0000 W - [initandlisten] Detected unclean shutdown - m:/mongodb/data\mongod.lock is not empty.
2015-06-09T00:41:35.254+0000 W STORAGE [initandlisten] Recovering data from the last clean checkpoint.
2015-06-09T00:41:35.255+0000 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),

During this time MongoDB appears to be reading every file in our database, we guess its some kind of file integrity check. I'm guessing this might be normal following an unclean shutdown in which case we need help with the shutdown issue . We've noticed a few Jira Issues relating to abnormal shutdowns due to the Windows Service manager and timeouts but they appear to be getting marked as fixed but as of yet our system keeps getting this problem. I have tried shutting down the service from a command prompt as well as the service manager and normal OS restarts but every one results in a long startup time. I'm not certain they all have produced the 'unclean shutdown' log message though.

We are running a 5 node replica set, 2 MMAP, 2 WT and 1 arbiter. Our database size is ~280GB (MMAP) / 180GB (WT). We were planning to transition to WiredTiger but this and a few other problems are showstoppers for us at the moment.



 Comments   
Comment by Ramon Fernandez Marina [ 03/Aug/15 ]

rogerdawson7, SERVER-19052 should have addressed the root cause of the problem you described in this ticket, so I'm going to mark this ticket as a duplicate of SERVER-19052. The fix is included in the 3.0.5 release, which can be downloaded here.

Regards,
Ramón.

Comment by Mark Benvenuto [ 22/Jun/15 ]

The Azure "Host Cache Preference" recommendation is for the MMapV1 storage engine. It is to workaround to an Azure Storage driver issue specific to memory mapped files. The Azure issue does not affect the WiredTiger storage engine.

In terms of background flush performance, SERVER-12401 does describe the issue. Please note that the issue was written before SERVER-13681 was fixed which mitigated the related performance impact of slow flushes.

In terms of your disk performance, from the data you gathered I can tell you the following:

  1. There was an average of 42.455 Disk Reads/sec
  2. There was an average of 25 ms per disk read (i.e., Avg. Disk sec/Read counter)
  3. Therefore on average, the system spent 1.061 seconds worth of reads per second. This meant it parallelized reads ever so slightly.
  4. This makes sense since we also saw an Avg. Disk Read Queue Length of 1.063.

We can see from this cursory analysis that the slow start was caused by high latency disk access. We have fixed the underlying cause in SERVER-19052. The process which reads in all the collection info for collections with < 10,000 documents is also executed with no parallelization, which makes this startup job particularly painful in this case.

Comment by Roger Dawson [ 22/Jun/15 ]

Just for reference, the latest restart took 2 hours 1 minute, suggesting a 20% reduction in times by using an Azure Standard VM instance over a Basic one.

Comment by Roger Dawson [ 22/Jun/15 ]

Hi Dan, Mark,

really sorry but I didn't realise the performance test Dan asked me to perform should have been during the MongoDB service starting up. I've repeated the tests while the system is in the starting up state (i.e. where its scanning all the files). The zipped up log file is attached.

Regarding your questions:

1) The instance is currently a basic tier. For the most recent performance test (attached) I have changed the instance to a Standard one in order to see if the down time decreases and thus to see how much of an impact the increase in IOPS from 300 to 500 has.

2) The Azure 'Host Cache Preference' for the MongoDB data drive is currently set to read/write as per the Azure section of the production notes (http://docs.mongodb.org/manual/administration/production-notes/). Could you confirm whether this is the preferred setting for WiredTiger AND MMAP? We have read information regarding this setting and they seem to recommend setting the cache to 'None' for database storage.

3) The documents are generally a few KB's in size. Most will be small (1-4 KB's) but there will be 200 or so collections with document sizes as high as 18 KB in them.

We understand with regards to the background flushing differences between Windows and Linux, as we came across this issue https://jira.mongodb.org/browse/SERVER-12401 a few days ago so things became clearer (maybe worth adding to the MongoDB documentation?).

One other observation, which might help. We've created another Linux VM, but this time using 4 striped disks, and we are seeing no better performance regarding the background flushing (granted this is MMAP so slightly off-topic), but does this suggest perhaps that the disk latencies are negating the benefits of striping?

Many Thanks,

Roger

Comment by Mark Benvenuto [ 19/Jun/15 ]

Roger, I examined the log data with Dan. Assuming this snapshot was captured during startup, we noticed that in the 15 minute snapshot: the system read ~54 MB, and wrote ~533 MB of data. Neither of these are large values, and it does not look like you are I/O bound. In terms of disk latency, we saw average times of 20ms which is not great, but this is probably normal for Azure Storage.

A few more questions about your setup:

  1. You say above you are using an A3, is this from the standard or basic tier?
  2. Do the disks have any caching enabled? (None, Read-only or Read/Write)
  3. In these collections with < 10,000 documents, what is the average size of these documents? a few kbs, a few mbs?

In terms of background flush performance, the Linux OS issues asynchronous I/Os during flushes of memory mapped files while the Windows OS issues synchronous I/Os. It is not a difference in MongoDB, but the OS.

Comment by Daniel Pasette (Inactive) [ 19/Jun/15 ]

I've created a ticket to track removing the sizeStorer update work at startup.

Comment by Daniel Pasette (Inactive) [ 19/Jun/15 ]

Mark, can you help me analyze this?

Comment by Roger Dawson [ 19/Jun/15 ]

Hello Dan,

here is the log output as requested,

Many Thanks,

Roger

Comment by Roger Dawson [ 18/Jun/15 ]

Hi Dan,

unfortunately the Linux VM we have created is running using MMAP. We might be spinning up new Linux VM's though to replace our Windows ones so we could configure one for WiredTiger. What we have noticed is considerably better performance from the Linux VM MMAP than the Windows MMAP VM regarding background flushing, although we can't understand why as the Linux VM is an A2 instance with a single disk (500 IOPS) and the Windows VM is an A3 instance with 4 striped disks (500 IOPS each). Windows ~ 20 seconds whereas Linux ~4 seconds.

I'll run the command you mention when I get a minute tomorrow morning and get the results back to you,

Thanks,

Roger

Comment by Daniel Pasette (Inactive) [ 18/Jun/15 ]

Thanks for sharing that info Roger. The startup time is dominated by MongoDB scanning your collections that are less than 10,000 documents. MongoDB scans these collections at startup to verify that the count and size statistics are correct for small collections. Under normal circumstances, this should be very fast. I set up server on linux with 1200 collections and document counts matching your breakdown above, using small documents of around 50 bytes. The startup time on my linux server with SSD's is around 4 seconds. The size of the documents could be a factor, but it could also be something to do with windows or azure storage.

I'm seeing in the logfile you attached that a single scan is taking several seconds, up to around 30 seconds which really doesn't make sense to me:

2015-06-17T13:34:01.015+0000 D STORAGE  [initandlisten] doing scan of collection OurCustomerDBV3.RC_PPUOVR_21_375630 to get info
2015-06-17T13:34:35.906+0000 D STORAGE  [initandlisten] looking up metadata for: OurCustomerDBV3.RC_PPUOVR_21_375701 @ RecordId(276)

Have you seen the same issue on your linux VM?

If you have time, it would be helpful to see a performance counter trace to understand why it is running so slow. You will need to run the following commands, and it will collect a trace file for 15 minutes of 1 second samples, and then stop. After it stops, attach the information to the ticket so understand how much write traffic there is and if is particularly high disk latency from your storage system in the first place.

logman delete mongodb_log
logman create counter mongodb_log -c "\Processor(_Total)\% Processor Time" "\Processor(_Total)\% Privileged Time" "\PhysicalDisk(_Total)\*" -f bin -rf 00:15:00 -si 1 -ow
logman start mongodb_log
later run
logman query mongodb_log to find the file(s), zip and attach to jira

Comment by Roger Dawson [ 18/Jun/15 ]

Hi Dan,

here is the very rough breakdown. We have approximately 150-200 of each of the collections I am about to describe:

Capped collection ~ 3000 docs
Large overview docs ~ 3000 docs max, 2500 docs average
Raw data 1 ~ 300,000 docs max, 200,000 docs average
Raw data 2 ~ 3,000,000 docs max, 2,200,000 docs average
New raw data 1 ~ 3000 docs
New raw data 2 ~ 50,000 docs max, 30,000 docs average

The two new raw data collections have been created in to improve write performance as the 'raw data 2' collections are approaching 1.4 GB in size and 300 MB index sizes.

So to answer your question, I would say of the 1200 or so collections we have about 50% are less than 10,000 documents in size.

Thanks, Roger

Comment by Daniel Pasette (Inactive) [ 17/Jun/15 ]

I think I know what the problem is. Would it be possible to say how many of the collections you have contain less than 10,000 documents?

Comment by Roger Dawson [ 17/Jun/15 ]

Hi Dan,

I've attached the directory listings for the data directory and journal directory following what I believe was a slow, but clean shutdown. I carried out the shutdown using the db.shutdownServer command on the Wired Tiger secondary as this seems the only reliable way to get no errors on shutdown. You'll see I also attached the log output from the node as it was shutdown and then restarted. It took approximately 2.5 hours to get up and running again. I'm guessing the extra 30 minutes is because we have more collections now. We started logging to new collections as some were very large and we thought might be responsible for the performance issues.

Thanks, Roger

Comment by Roger Dawson [ 17/Jun/15 ]

Hi Dan, okay, I'll do that but it might take a while. We have add a Linux VM to the replica set as we want to see if it is any more performant/reliable than Windows and it's still replicating at the moment (building the indexes for the last 24 hours so hopefully on the last leg). Once this is completed we can risk taking the wired tiger node out of the replica set. We are down to a primary (mmap), secondary (WT) and arbiter as one of our MMAP nodes restarted following Windows updates and irrecoverably fails on startup (it's happened twice now) and the Linux box currently replicating has replaced one of the wired tiger nodes.

I'm not sure if its related but we're experiencing a lot of issues with the whole replica set these days such as nodes failing to start following unclean shutdowns (which seem to occur almost always but its only the mmap's that have actually crashed irrecoverably), very high background flush times etc. We've tried a lot of things to improve performance and it has made our processing considerably quicker but none of it seems to be helping the startup times, high background flush times etc. Most of these issues seemed to begin occurring when we transitioned to 3.0 but that could be a coincidence. We wanted to run version 3.0 as we found we needed very large page files sizes (~50GB at the moment) and though WiredTiger would solve this as well as be faster.

Comment by Daniel Pasette (Inactive) [ 17/Jun/15 ]

Hi Roger, I appreciate your help here. Sorry it wasn't clear in the last request, but in addition to the dbpath contents, we also need the contents of the journal directory, but after a clean shutdown and before restarting the server so that we can verify that the storage engine is not attempting to recover.

Also, when you do restart your mongod, can you start with verbose logging and attach the output? mongod --storageEngine wiredTiger -v from the command line or in the config file:

systemLog:
   verbosity: 1

Thanks,
Dan

Comment by Roger Dawson [ 17/Jun/15 ]

Directory contents of our wired tiger secondary as requested.

Comment by Roger Dawson [ 16/Jun/15 ]

Hi Ramon,

I'll try and get the directory listing you requested. Yes, we have had quite a few unclean shutdowns on this machine. Since we moved to version 3.0 we've struggled to have a clean shutdown on any of our machines. However, I did manage what I believe was a clean shutdown recently, as indicated by the log above, when I used the db.shutdownServer command. However, this still resulted in the system taking 2 hours to startup, as illustrated by the last two log entries.

Thanks,

Roger

Comment by Ramon Fernandez Marina [ 15/Jun/15 ]

rogerdawson7, can you please send us a directory listing of your dbpath? Also, have you had any unclean shutdowns on this machine? Other than a slow storage layer (WT needs to open over 2K files on startup in your case) the other thing that may cause this behavior is a recovery operation, which may happen after an unclean shutdown.

Thanks,
Ramón.

Comment by Roger Dawson [ 09/Jun/15 ]

Just to add to this, it doesn't look to be limited to unclean shutdowns. I had to shut my Wired Tiger secondary down so I logged into MongoDB (via shell) and executed db.shutdownServer with a 120 second timeout in an attempt to shutdown gracefully following comments I've seen in the forums. The operation finished after a few seconds and didn't appear to have any issues. Upon restart my server still took a couple of hours to transition to the secondary state. The log output is shown below, please note the time difference between the last two log entries:

15-06-09T13:26:32.981+0000 I COMMAND [conn4501] terminating, shutdown command received
2015-06-09T13:26:33.233+0000 I STORAGE [conn4512] got request after shutdown()
2015-06-09T13:26:33.359+0000 I STORAGE [WiredTigerRecordStoreThread for local.oplog.rs] shutting down
2015-06-09T13:26:33.462+0000 I STORAGE [conn4513] got request after shutdown()
2015-06-09T13:26:33.887+0000 I STORAGE [conn4510] got request after shutdown()
2015-06-09T13:26:34.580+0000 I STORAGE [conn4413] got request after shutdown()
2015-06-09T13:26:34.779+0000 I STORAGE [conn3125] got request after shutdown()
2015-06-09T13:26:37.237+0000 W NETWORK [ReplExecNetThread-107] Failed to connect to 10.0.1.70:27017 after 5000 milliseconds, giving up.
2015-06-09T13:26:37.237+0000 I REPL [serviceStopWorker] Stopping replication applier threads
2015-06-09T13:26:37.239+0000 I CONTROL [serviceStopWorker] now exiting
2015-06-09T13:26:37.239+0000 I NETWORK [serviceStopWorker] shutdown: going to close listening sockets...
2015-06-09T13:26:37.239+0000 I NETWORK [serviceStopWorker] closing listening socket: 308
2015-06-09T13:26:37.240+0000 I NETWORK [serviceStopWorker] shutdown: going to flush diaglog...
2015-06-09T13:26:37.240+0000 I NETWORK [serviceStopWorker] shutdown: going to close sockets...
2015-06-09T13:26:37.266+0000 I STORAGE [serviceStopWorker] WiredTigerKVEngine shutting down
2015-06-09T13:26:47.991+0000 I CONTROL [initandlisten] Service Stop is waiting for storage engine to finish shutdown
2015-06-09T13:26:54.694+0000 I STORAGE [serviceStopWorker] shutdown: removing fs lock...
2015-06-09T13:26:54.700+0000 I CONTROL [serviceStopWorker] dbexit: rc: 49
2015-06-09T13:34:16.290+0000 I CONTROL ***** SERVER RESTARTED *****
2015-06-09T13:34:16.885+0000 I CONTROL Trying to start Windows service 'MongoDB'
2015-06-09T13:34:16.896+0000 I STORAGE Service running
2015-06-09T13:34:17.235+0000 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-06-09T15:23:02.179+0000 I STORAGE [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs

Generated at Thu Feb 08 03:49:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.