[SERVER-30238] High disk usage and query blocking on database drops Created: 20/Jul/17  Updated: 24/Aug/17  Resolved: 31/Jul/17

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 3.4.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Henderson Assignee: Bruce Lucas (Inactive)
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-30426 dropDatabase very slow due to repeate... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:
  • Drop database (WT, 3.4.6)
  • Queries and updates stall.
Participants:

 Description   

This case is being raised from WT-3207 - in the latest 3.4.6 release, the bottleneck appears to have moved from CPU to disk.

Image 1 - the database is dropped (09:02:47) and the queries, updates have dropped off.

Image 2 - Recovery - at 09:04:45 the primary recovers and starts processing requests. Note the list Databases that returned after a long period of time.

A new run with diagnostics.data to follow - Can I have a secure link to submit it to?



 Comments   
Comment by Richard Yates [ 31/Jul/17 ]

Thanks Bruce, I agree it's best to start a new case for the specific issue you've identified. Feel free to close this one down, Dave and I will keep an eye on the new ticket. Appreciate the workaround too, I'll give that a try.

– Richard.

Comment by Bruce Lucas (Inactive) [ 31/Jul/17 ]

Thanks Richard, that indeed has been helpful. The perf data has allowed us to pinpoint where we are spending the 30 seconds of CPU time that makes the dropDatabase operation slow in this case. I've opened SERVER-30426 to focus on that specific problem as we have covered a couple of different issues in this ticket. Please follow that ticket for further updates.

Meanwhile it is possible that you can mitigate the impact of this issue by dropping the individual collections in the database before dropping the database. The dropCollection command only takes a database lock, whereas the dropDatabase command takes a global lock, so the impact of the individual dropCollection commands will be limited primarily to the database that you are dropping.

Thanks for bringing this issue to our attention. If you have no further questions I'll close this ticket out and we'll continue work on this specific issue on SERVER-30426.

Bruce

Comment by Richard Yates [ 31/Jul/17 ]

Hi Bruce,

Good news, I've managed to reproduce easily with all the metrics collections going. I've uploaded mongoanalysis.tar.gz which contains:

  • A redacted loglevel 2 mongodb log file (many thanks for the pointer to the fruitsalad tool!)
  • Today's mongodb diag files
  • The unprocessed perf.data file + processed perfstacks.txt (as output by perf script in your previous comment)

As requested I started the perf data collection, did a logrotate, turned on level 2 logging, left it for about 15 seconds then did the dropdatabase. I left the perf data going for the full 2 minutes and the dropdatabase + subsequent stall completed within that time.

There won't be any other dropDatabase commands in that time so you should be able to see the exact start+completed times in the log file.

Thanks for the continued support, hope that bundle helps with the investigation.
– Richard.

Comment by Bruce Lucas (Inactive) [ 26/Jul/17 ]

Hi Richard,

You can find a tool here that many of our customers have successfully used to redact log files. Let me know if you have any questions.

Given that some work is required to reproduce the problem, I would suggest if possible capturing perf data at the same time. Some care is required as the volume of data capture by perf can be large. The following will capture perf data for 2 minutes, then process it into a form useful for our analysis:

perf record  -a -g -F 99 sleep 120
perf script >perfstacks.txt

The idea would be to start this, let it run for say 15-30 seconds to capture a baseline, then do the dropDatabase, then let perf continue to run to capture the stall and more baseline.

If for any reason it's not practical to closely coordinate running perf with the dropDatabase, you can break up the perf data into one-minute segments with the following:

while true; do perf record -a -g -F 99 -o perf.data.$(date -u +%FT%TZ) sleep 60; done

Then after running dropDatabase, identify and process the data files that include the dropDatabase plus some surrounding data for baseline comparison:

for fn in ...; do perf script -i $fn >$fn.txt; done

Thanks you for your help.
Bruce

Comment by Richard Yates [ 26/Jul/17 ]

Hi Bruce,

I think that would be possible. My concern would be revealing our client's data in slow queries logged during that time but I will see if turning off slow query logging works without destroying the debug data you're after in the first place. Alternatively we'll have to see about obfuscating the data parts, if you have a handy tool for this (or some regex grep commands lying about!) that would be perfect.

I'm fresh out of databases to drop so give me a little time to get some data from backups to drop and I'll come back with my findings.

Comment by Bruce Lucas (Inactive) [ 25/Jul/17 ]

Hi Richard,

The diagnostic data indicates that there is no longer a large amount of disk i/o, so removing the "discard" option has indeed addressed that issue.

I've spent some time trying to reproduce the remaining stall but so far have not succeeded. The diagnostic data indicates that ~300 WiredTiger tables (collections and indexes) were removed over the course of ~30 seconds, at an average rate of very close to 10 per second. This number makes me suspect possibly something is pacing the table drops at 10 per second. Unfortunately the stall has also created a gap in the diagnostic data, so we don't know in detail the timing of events during that 30 seconds.

Would you be in a position to reproduce the problem while running at log level 2, and then upload the log file to the secure upload portal that we provided earlier?

Beyond this we could also collect the following information while you reproduce the problem:

  • perf data: this should have minimal impact on the system, but will only "see" CPU activity and won't see blocked threads. The diagnostic data indicates that there was significant CPU activity during the stall, so this may be sufficient. This can be set up to run for an extended period until the problem reproduces.
  • gdb stack traces: this will require timing the collection of the stack traces to coincide with a stall, but it will give more information as it will show what blocked threads are doing. It will block the mongod process for a few seconds, in addition to the stall.

Please let us know if either of the above options is possible (assuming the log file isn't sufficient) and we'll provide more details.

Thanks,
Bruce

Comment by Richard Yates [ 24/Jul/17 ]

Hi Bruce,

Bad news I'm afraid. After disabling discard on both primary+secondary nodes in a replicaset I delete another database around the same size as before (we do this on a weekly basis) and it still stalled the primary. The good news is it only stalled it for 30 seconds or so which is the quickest I've seen so I think disabling discard has helped.

I've uploaded another metrics.interim file to the secure URL provided above if you're able to take another look?

Thanks,
Richard.

Comment by Bruce Lucas (Inactive) [ 21/Jul/17 ]

Hi Richard,

We will indeed want to add this to our production notes once we have confirmation from you that it was the cause of your issue. I'll put this ticket in "waiting for customer" state pending that confirmation.

Thanks,
Bruce

Comment by Richard Yates [ 21/Jul/17 ]

Hi Bruce,

I'm one of David's colleagues.

Thanks for the pointers around discard + SSD TRIM. I've done some more testing and forcing TRIM to run manually via fstrim caused identical behavior to what we've been reporting so it definitely looks like you're onto something.

We'll disable discard over the coming days (we have multiple replicasets so need a little time to demote primaries, remount etc) and let you know if that appears to be the cause.

Since it was AWS' recommendation to enable discard/TRIM support for instance volumes like we're using (although I can't find that reference now so they may have revised their advice!) can I suggest you amend your production notes to include specific mention of not using the discard mount option. We'll now plan to run fstrim manually periodically during a maintenance window to avoid any production impact and would probably recommend others do the same.

David also found this relevant thread from mongodb-user: https://groups.google.com/forum/#!topic/mongodb-user/Mj0x6m-02Ms which sounds similar

Comment by Bruce Lucas (Inactive) [ 21/Jul/17 ]

Hi David,

The diagnostic data that you uploaded showed about 31 GB of data being written to the device during the drop operation, which is about the compressed size of the db, that is its size on disk, as you reported. However the metrics showed that the storage engine was not doing any i/o.

I suspect that the "discard" mount option may be responsible for the i/o. This option causes trim commands to be issued to the device to discard unused blocks, and per the referenced article can have a severe performance impact. I believe this is causing the file removes involved in the dropDatabase operation to be very slow and generate a lot of i/o.

Would it be possible to re-mount the filesystem without the "discard" option and test whether performance improves for dropDatabase?

Thanks,
Bruce

Comment by David Henderson [ 21/Jul/17 ]

/dev/nvme0n1 on /srv/mongodb type xfs (rw,noatime,discard)
On i3.2xlarge on AWS, Ubuntu 14.04.5 LTS, kernel: 4.4.0-78-generic

Comment by Bruce Lucas (Inactive) [ 21/Jul/17 ]

Thanks for that info David. Can you tell us what filesystem you are using and what mount options?

Comment by David Henderson [ 21/Jul/17 ]

It was ~ 26GB compressed around 107 collections, 2 indexes in each collection (_id, 1 compound index on _id, "data.c")

Estimated collection sizes:

Biggest collection 26.9GB in size (uncompressed), 13.3 million docs
3 more collections between 1 and 2.3 million docs
then the rest under 1 million each

Comment by Bruce Lucas (Inactive) [ 21/Jul/17 ]

Thanks David. Can you tell us about the database being dropped - approximately how big is it (compressed size on disk if you know), and how many collections and indexes?

Comment by David Henderson [ 21/Jul/17 ]

Thanks - I've uploaded the relevant files for a new instance of the issue. It seems that this issue is much easier to trigger than the previous version in https://jira.mongodb.org/browse/WT-3207 which tended to require multiple drops.

Comment by Kelsey Schubert [ 20/Jul/17 ]

Hi david.henderson@triggeredmessaging.com,

I've created a secure upload for you to use to provide the diagnostic.data.

Thank you,
Thomas

Generated at Thu Feb 08 04:23:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.