[SERVER-30238] High disk usage and query blocking on database drops Created: 20/Jul/17 Updated: 24/Aug/17 Resolved: 31/Jul/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance |
| Affects Version/s: | 3.4.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Henderson | Assignee: | Bruce Lucas (Inactive) |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: |
|
||||||||
| Participants: | |||||||||
| Description |
|
This case is being raised from Image 1 - the database is dropped (09:02:47) and the queries, updates have dropped off.
Image 2 - Recovery - at 09:04:45 the primary recovers and starts processing requests. Note the list Databases that returned after a long period of time.
A new run with diagnostics.data to follow - Can I have a secure link to submit it to? |
| Comments |
| Comment by Richard Yates [ 31/Jul/17 ] | ||||
|
Thanks Bruce, I agree it's best to start a new case for the specific issue you've identified. Feel free to close this one down, Dave and I will keep an eye on the new ticket. Appreciate the workaround too, I'll give that a try. – Richard. | ||||
| Comment by Bruce Lucas (Inactive) [ 31/Jul/17 ] | ||||
|
Thanks Richard, that indeed has been helpful. The perf data has allowed us to pinpoint where we are spending the 30 seconds of CPU time that makes the dropDatabase operation slow in this case. I've opened Meanwhile it is possible that you can mitigate the impact of this issue by dropping the individual collections in the database before dropping the database. The dropCollection command only takes a database lock, whereas the dropDatabase command takes a global lock, so the impact of the individual dropCollection commands will be limited primarily to the database that you are dropping. Thanks for bringing this issue to our attention. If you have no further questions I'll close this ticket out and we'll continue work on this specific issue on Bruce | ||||
| Comment by Richard Yates [ 31/Jul/17 ] | ||||
|
Hi Bruce, Good news, I've managed to reproduce easily with all the metrics collections going. I've uploaded mongoanalysis.tar.gz which contains:
As requested I started the perf data collection, did a logrotate, turned on level 2 logging, left it for about 15 seconds then did the dropdatabase. I left the perf data going for the full 2 minutes and the dropdatabase + subsequent stall completed within that time. There won't be any other dropDatabase commands in that time so you should be able to see the exact start+completed times in the log file. Thanks for the continued support, hope that bundle helps with the investigation. | ||||
| Comment by Bruce Lucas (Inactive) [ 26/Jul/17 ] | ||||
|
Hi Richard, You can find a tool here that many of our customers have successfully used to redact log files. Let me know if you have any questions. Given that some work is required to reproduce the problem, I would suggest if possible capturing perf data at the same time. Some care is required as the volume of data capture by perf can be large. The following will capture perf data for 2 minutes, then process it into a form useful for our analysis:
The idea would be to start this, let it run for say 15-30 seconds to capture a baseline, then do the dropDatabase, then let perf continue to run to capture the stall and more baseline. If for any reason it's not practical to closely coordinate running perf with the dropDatabase, you can break up the perf data into one-minute segments with the following:
Then after running dropDatabase, identify and process the data files that include the dropDatabase plus some surrounding data for baseline comparison:
Thanks you for your help. | ||||
| Comment by Richard Yates [ 26/Jul/17 ] | ||||
|
Hi Bruce, I think that would be possible. My concern would be revealing our client's data in slow queries logged during that time but I will see if turning off slow query logging works without destroying the debug data you're after in the first place. Alternatively we'll have to see about obfuscating the data parts, if you have a handy tool for this (or some regex grep commands lying about!) that would be perfect. I'm fresh out of databases to drop so give me a little time to get some data from backups to drop and I'll come back with my findings. | ||||
| Comment by Bruce Lucas (Inactive) [ 25/Jul/17 ] | ||||
|
Hi Richard, The diagnostic data indicates that there is no longer a large amount of disk i/o, so removing the "discard" option has indeed addressed that issue. I've spent some time trying to reproduce the remaining stall but so far have not succeeded. The diagnostic data indicates that ~300 WiredTiger tables (collections and indexes) were removed over the course of ~30 seconds, at an average rate of very close to 10 per second. This number makes me suspect possibly something is pacing the table drops at 10 per second. Unfortunately the stall has also created a gap in the diagnostic data, so we don't know in detail the timing of events during that 30 seconds. Would you be in a position to reproduce the problem while running at log level 2, and then upload the log file to the secure upload portal that we provided earlier? Beyond this we could also collect the following information while you reproduce the problem:
Please let us know if either of the above options is possible (assuming the log file isn't sufficient) and we'll provide more details. Thanks, | ||||
| Comment by Richard Yates [ 24/Jul/17 ] | ||||
|
Hi Bruce, Bad news I'm afraid. After disabling discard on both primary+secondary nodes in a replicaset I delete another database around the same size as before (we do this on a weekly basis) and it still stalled the primary. The good news is it only stalled it for 30 seconds or so which is the quickest I've seen so I think disabling discard has helped. I've uploaded another metrics.interim file to the secure URL provided above if you're able to take another look? Thanks, | ||||
| Comment by Bruce Lucas (Inactive) [ 21/Jul/17 ] | ||||
|
Hi Richard, We will indeed want to add this to our production notes once we have confirmation from you that it was the cause of your issue. I'll put this ticket in "waiting for customer" state pending that confirmation. Thanks, | ||||
| Comment by Richard Yates [ 21/Jul/17 ] | ||||
|
Hi Bruce, I'm one of David's colleagues. Thanks for the pointers around discard + SSD TRIM. I've done some more testing and forcing TRIM to run manually via fstrim caused identical behavior to what we've been reporting so it definitely looks like you're onto something. We'll disable discard over the coming days (we have multiple replicasets so need a little time to demote primaries, remount etc) and let you know if that appears to be the cause. Since it was AWS' recommendation to enable discard/TRIM support for instance volumes like we're using (although I can't find that reference now so they may have revised their advice!) can I suggest you amend your production notes to include specific mention of not using the discard mount option. We'll now plan to run fstrim manually periodically during a maintenance window to avoid any production impact and would probably recommend others do the same. David also found this relevant thread from mongodb-user: https://groups.google.com/forum/#!topic/mongodb-user/Mj0x6m-02Ms which sounds similar | ||||
| Comment by Bruce Lucas (Inactive) [ 21/Jul/17 ] | ||||
|
Hi David, The diagnostic data that you uploaded showed about 31 GB of data being written to the device during the drop operation, which is about the compressed size of the db, that is its size on disk, as you reported. However the metrics showed that the storage engine was not doing any i/o. I suspect that the "discard" mount option may be responsible for the i/o. This option causes trim commands to be issued to the device to discard unused blocks, and per the referenced article can have a severe performance impact. I believe this is causing the file removes involved in the dropDatabase operation to be very slow and generate a lot of i/o. Would it be possible to re-mount the filesystem without the "discard" option and test whether performance improves for dropDatabase? Thanks, | ||||
| Comment by David Henderson [ 21/Jul/17 ] | ||||
|
/dev/nvme0n1 on /srv/mongodb type xfs (rw,noatime,discard) | ||||
| Comment by Bruce Lucas (Inactive) [ 21/Jul/17 ] | ||||
|
Thanks for that info David. Can you tell us what filesystem you are using and what mount options? | ||||
| Comment by David Henderson [ 21/Jul/17 ] | ||||
|
It was ~ 26GB compressed around 107 collections, 2 indexes in each collection (_id, 1 compound index on _id, "data.c") Estimated collection sizes: Biggest collection 26.9GB in size (uncompressed), 13.3 million docs | ||||
| Comment by Bruce Lucas (Inactive) [ 21/Jul/17 ] | ||||
|
Thanks David. Can you tell us about the database being dropped - approximately how big is it (compressed size on disk if you know), and how many collections and indexes? | ||||
| Comment by David Henderson [ 21/Jul/17 ] | ||||
|
Thanks - I've uploaded the relevant files for a new instance of the issue. It seems that this issue is much easier to trigger than the previous version in https://jira.mongodb.org/browse/WT-3207 which tended to require multiple drops. | ||||
| Comment by Kelsey Schubert [ 20/Jul/17 ] | ||||
|
Hi david.henderson@triggeredmessaging.com, I've created a secure upload for you to use to provide the diagnostic.data. Thank you, |