[SERVER-16665] WiredTiger b-tree uses too much space in journal directory Created: 24/Dec/14 Updated: 06/Dec/22 Resolved: 21/Nov/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage, WiredTiger |
| Affects Version/s: | 2.8.0-rc3 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Mark Callaghan | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Done | Votes: | 2 |
| Labels: | wiredtiger | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
I am running the iibench for MongoDB on two hosts. Both use the WT b-tree, one with snappy and the other with zlib compression. The test uses 10 threads to load 200M documents per thread. Immediately after the load is done the zlib test uses ~500G versus ~330G for snappy. With some idle time the zlib disk usage drops to ~160G. At first I thought the problem was only for zlib, but watching the size of data/journal during tests shows problems for snappy and zlib. My test host has 40 hyperthread cores, 144G RAM and fast (PCIe) flash Can WiredTiger delete log files sooner? I have a test in progress and it is about 30% done. The problem for the zlib configuration is too much space used in the journal directory. Note that "data" is the root for database files:
And this is from the snappy test at about 20% done:
And this is from later in the snappy test:
And later in the zlib test:
|
| Comments |
| Comment by Alexander Gorrod [ 21/Nov/16 ] | ||||||||||||||||||||
|
mdcallag Sorry for the slow turn around on this ticket. We have been working hard to bound how long checkpoints take. A few tickets that tracked the work are: Given that the root cause for this issue was identified as being long checkpoints here I have re-tested with the iibench workload and could not reproduce your original symptoms of excessive log file accumulation - probably because I have a different hardware setup. Having said that I do see significant reduction in checkpoint time when comparing MongoDB 2.8.0-rc3 to the 3.4.0-rc3 release candidate:
I also observed that checkpoints were taking an increasingly long time to complete with the 2.8.0 version of MongoDB, but the required time is fairly stable with the 3.4.0 release candidate. I'm going to close this ticket for now - please re-open if you have a chance to re-test and find that the issue is still present in recent versions of MongoDB. | ||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 30/Sep/16 ] | ||||||||||||||||||||
|
mdcallag, we are re-running iibench against the latest 3.2.x release and 3.3.x prerelease now that | ||||||||||||||||||||
| Comment by Mark Callaghan [ 06/Jun/16 ] | ||||||||||||||||||||
|
Copying what I appended to https://jira.mongodb.org/browse/SERVER-20306. I used the eviction_dirty_* config options to limit the number of dirty pages but that had bad side effects. I will try to use tcmalloc for future WT tests. I have been using jemalloc for all engines to reduce source of variance. I tried both of these and they reduced the max RSS to not much larger than the WT block cache. That is good. The problems were: | ||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 03/Jun/16 ] | ||||||||||||||||||||
|
mdcallag, thanks for the new information. As you noted, we are tracking excess memory use in The fundamental problem in this ticket is that checkpoints are taking too long. During the high-throughput insert phase, one checkpoint takes 668 seconds to complete. We accumulate log files from the beginning of checkpoint N to the end of checkpoint N+1, so in this case that could be 1300 seconds worth of log files. Since the application is generating ~100MB of logs per second, this explains the size of the journal directory. There are several reasons checkpoints are taking so long:
We added eviction_dirty_trigger in an attempt to address #1, but as you have noted, the impact on performance for some workloads is unacceptable. We are working on addressing that in We don't yet have a design for #2, multi-threaded checkpoints: to some extent we are hoping that the solution to #1 makes this unnecessary. If you squint, LSM is also shifting work out of checkpoints, and of course iibench is designed to show off write-optimized access methods. I suspect the sweet spot is to use btree for collections and LSM for indexes: I will test that configuration and let you know what I see. For #3, you could try running with a source code change to disable the first pass through cache before the real checkpoint:
I'll run your iibench workload next week and let you know anything else I find. | ||||||||||||||||||||
| Comment by Mark Callaghan [ 02/Jun/16 ] | ||||||||||||||||||||
|
Another side effect of setting eviction_dirty_* options is that the insert rate drops by 10X for my insert-only test compared to not setting the options. | ||||||||||||||||||||
| Comment by Mark Callaghan [ 02/Jun/16 ] | ||||||||||||||||||||
|
Test restarted with a 128G block And the problem repeats with a large journal directory early in the test But eviction_dirty_ {target|trigger} are working as requested: | ||||||||||||||||||||
| Comment by Mark Callaghan [ 02/Jun/16 ] | ||||||||||||||||||||
|
db.serverStatus() from a WiredTiger run when RSS is ~2X the size of the WT block cache | ||||||||||||||||||||
| Comment by Mark Callaghan [ 02/Jun/16 ] | ||||||||||||||||||||
|
I am repeating tests after setting eviction_dirty_trigger: | ||||||||||||||||||||
| Comment by Mark Callaghan [ 02/Jun/16 ] | ||||||||||||||||||||
|
Metrics file from diagnostic.data | ||||||||||||||||||||
| Comment by Mark Callaghan [ 02/Jun/16 ] | ||||||||||||||||||||
|
Whether or not this is expected the behavior is far from desirable. For the workload explained in https://jira.mongodb.org/browse/SERVER-20306 I get a huge journal directory because the insert rate into WiredTiger is too high. Once the block cache gets full and eviction (stalls) slows the insert rate down then the journal directory shrinks. In one case I get at least 133G in the journal when database + journal is ~190G. Once the WT block cache gets full the insert rate slows and the journal drops to less than 10G. Then I experience https://jira.mongodb.org/browse/SERVER-20306. So WT is either using too much space in the journal directory or too much memory. | ||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 19/May/15 ] | ||||||||||||||||||||
|
The current space usage described in this ticket is expected, but since there's room for improvement so we're repurposing this ticket to evaluate what enhancements can be made for 3.2 in this area. | ||||||||||||||||||||
| Comment by Mark Callaghan [ 06/Jan/15 ] | ||||||||||||||||||||
|
Created feature requests:
| ||||||||||||||||||||
| Comment by Mark Callaghan [ 06/Jan/15 ] | ||||||||||||||||||||
|
Using this configuration appears to help: My understanding is eviction can be multi-threaded and with compression enabled eviction threads must compress data so more threads can be important. Checkpoint is single-threaded, although there is an option for "checkpoint_threads" in the wtperf.c client. Checkpoint also compresses given this thread stack I get while running iibench, so support for more checkpoint threads or helper threads to do the compression might help.
| ||||||||||||||||||||
| Comment by Mark Callaghan [ 06/Jan/15 ] | ||||||||||||||||||||
|
This might be the first time I read the cache docs. With the default, eviction starts when 95% of the cache limit is reached and runs until that is reduced to 80% of the limit. Throwing out 15% of the cache seems seems like a drastic change especially when we have servers with 100G and more of RAM. Will this be made more incremental in the future? http://source.wiredtiger.com/2.4.1/tune_cache.html | ||||||||||||||||||||
| Comment by Mark Callaghan [ 06/Jan/15 ] | ||||||||||||||||||||
|
I thought of trying a combination of eviction_dirty_target and eviction[threads_max=X] but without support for eviction_dirty_trigger I suspect that will make the problem worse. Instead I will just try eviction[threads_max=X] for X > 1. | ||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 06/Jan/15 ] | ||||||||||||||||||||
|
Thanks for the update, mdcallag. I understand that a small cache is not practical in general: I made that suggestion to get a handle on the underlying cause. What WiredTiger needs to do here is manage the amount of work involved in checkpoints: i.e., the maximum amount of I/O required to complete a checkpoint, so that checkpoint performance is not determined by cache size. We have a setting eviction_dirty_target to limit what fraction of the cache can be dirty pages (so that you could theoretically have a 100GB cache but never need to write more than 1GB of data to complete a checkpoint). Unfortunately, as the name suggests, this is only a target and there is no guarantee that inserts cannot swamp eviction. InnoDB's fuzzy checkpoints are a better solution, and that includes potentially stalling application threads before they can make pages dirty (which WiredTiger only does once the cache becomes full). We are not going to have a full solution to this problem for MongoDB 2.8, but it's high on our todo list. | ||||||||||||||||||||
| Comment by Mark Callaghan [ 05/Jan/15 ] | ||||||||||||||||||||
|
Using a small cache means that some of my tests fall over from spending too much time doing cache eviction. Here is an example from iibench with --wiredTigerCacheSizeGB 4. The doc insert rate drops from 20,000 to 400 because of this. With a larger cache it drops from 20,000 to 10,000. I am using snappy compression.
| ||||||||||||||||||||
| Comment by Mark Callaghan [ 29/Dec/14 ] | ||||||||||||||||||||
|
It doesn't reproduce with --wiredTigerCacheSizeGB 1. That also cuts the document insert rate by half during my test. I also reported long checkpoint stalls from this test configuration for the WT b-tree, so I agree that might be the root cause. | ||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 29/Dec/14 ] | ||||||||||||||||||||
|
Hi mdcallag, we will have to take a closer look at this one, but I suspect what is happening is that checkpoints are taking a (very) long time to complete. That's because MongoDB will configure WiredTiger with a 72GB cache on your system, and flushing that much data takes a long time (particularly with zlib compression). If you have a chance to explore this further, it would be interesting to know whether this still happens with a smaller cache (i.e. running mongod --wiredTigerCacheSizeGB 1). | ||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 26/Dec/14 ] | ||||||||||||||||||||
|
Thanks for the report. | ||||||||||||||||||||
| Comment by Mark Callaghan [ 24/Dec/14 ] | ||||||||||||||||||||
|
I have not encountered this problem with the WiredTiger LSM |