[SERVER-16665] WiredTiger b-tree uses too much space in journal directory Created: 24/Dec/14  Updated: 06/Dec/22  Resolved: 21/Nov/16

Status: Closed
Project: Core Server
Component/s: Storage, WiredTiger
Affects Version/s: 2.8.0-rc3
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Mark Callaghan Assignee: Backlog - Storage Execution Team
Resolution: Done Votes: 2
Labels: wiredtiger
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File es.022553     File metrics.2016-05-31T03-47-19Z-00000.gz    
Issue Links:
Depends
depends on SERVER-16736 support more than 1 checkpoint thread... Closed
depends on SERVER-16737 support eviction_dirty_trigger for Wi... Closed
Related
related to SERVER-16575 intermittent slow inserts with WiredT... Closed
is related to WT-2764 Optimize checkpoints to reduce throug... Closed
is related to WT-2389 Spread Out I/O Rather Than Spiking Du... Closed
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Participants:

 Description   

I am running the iibench for MongoDB on two hosts. Both use the WT b-tree, one with snappy and the other with zlib compression. The test uses 10 threads to load 200M documents per thread. Immediately after the load is done the zlib test uses ~500G versus ~330G for snappy. With some idle time the zlib disk usage drops to ~160G. At first I thought the problem was only for zlib, but watching the size of data/journal during tests shows problems for snappy and zlib.

My test host has 40 hyperthread cores, 144G RAM and fast (PCIe) flash

Can WiredTiger delete log files sooner?

I have a test in progress and it is about 30% done. The problem for the zlib configuration is too much space used in the journal directory. Note that "data" is the root for database files:

du -hs data data/journal; ls data/journal | wc -l
163G    data
131G    data/journal
1337

And this is from the snappy test at about 20% done:

du -hs data data/journal; ls data/journal | wc -l
69G     data
23G     data/journal
231

And this is from later in the snappy test:

du -hs data data/journal; ls data/journal | wc -l
120G    data
47G     data/journal
475

ls -lh data/journal/
total 25G
-rw-r--r-- 1 root root 100M Dec 24 07:58 WiredTigerLog.0000001012
-rw-r--r-- 1 root root 100M Dec 24 07:58 WiredTigerLog.0000001013
-rw-r--r-- 1 root root 100M Dec 24 07:58 WiredTigerLog.0000001014
<snip>

And later in the zlib test:

du -hs data data/journal; ls data/journal | wc -l
231G    data
182G    data/journal
1862



 Comments   
Comment by Alexander Gorrod [ 21/Nov/16 ]

mdcallag Sorry for the slow turn around on this ticket. We have been working hard to bound how long checkpoints take. A few tickets that tracked the work are: WT-2764 and WT-2389. This work has taken place during the MongoDB 3.4 development cycle, and many of those changes have also been included in the MongoDB 3.2.11 release.

Given that the root cause for this issue was identified as being long checkpoints here I have re-tested with the iibench workload and could not reproduce your original symptoms of excessive log file accumulation - probably because I have a different hardware setup. Having said that I do see significant reduction in checkpoint time when comparing MongoDB 2.8.0-rc3 to the 3.4.0-rc3 release candidate:

Version Maximum checkpoint time (seconds)
2.8.0-rc3 79
3.4.0-rc3 17

I also observed that checkpoints were taking an increasingly long time to complete with the 2.8.0 version of MongoDB, but the required time is fairly stable with the 3.4.0 release candidate.

I'm going to close this ticket for now - please re-open if you have a chance to re-test and find that the issue is still present in recent versions of MongoDB.

Comment by Michael Cahill (Inactive) [ 30/Sep/16 ]

mdcallag, we are re-running iibench against the latest 3.2.x release and 3.3.x prerelease now that SERVER-20306 is resolved. We'll post more detailed results here when we have them, but checkpoints are now completing consistently so old journal files are being deleted as expected.

Comment by Mark Callaghan [ 06/Jun/16 ]

Copying what I appended to https://jira.mongodb.org/browse/SERVER-20306. I used the eviction_dirty_* config options to limit the number of dirty pages but that had bad side effects. I will try to use tcmalloc for future WT tests. I have been using jemalloc for all engines to reduce source of variance.

I tried both of these and they reduced the max RSS to not much larger than the WT block cache. That is good. The problems were:
1) the average insert rate during the test dropped by more than 20X
2) the CPU load increased by more than 5X
"eviction_dirty_target=20,eviction_dirty_trigger=25"
"eviction_dirty_target=10,eviction_dirty_trigger=20"

Comment by Michael Cahill (Inactive) [ 03/Jun/16 ]

mdcallag, thanks for the new information.

As you noted, we are tracking excess memory use in SERVER-20306. I saw in the diagnostic data that you are building with jemalloc. We are doing some testing at the moment with jemalloc, but in the meantime one advantage of stock builds with tcmalloc is that MongoDB can include tcmalloc's internal statistics in serverStatus and thus diagnostic.data. That would help us diagnose the cause of the excess memory use.

The fundamental problem in this ticket is that checkpoints are taking too long. During the high-throughput insert phase, one checkpoint takes 668 seconds to complete. We accumulate log files from the beginning of checkpoint N to the end of checkpoint N+1, so in this case that could be 1300 seconds worth of log files. Since the application is generating ~100MB of logs per second, this explains the size of the journal directory.

There are several reasons checkpoints are taking so long:

  1. a very large percentage of the cache is dirty, so checkpoints have a lot of work to do;
  2. WiredTiger checkpoints are currently single-threaded; and
  3. our attempt to reduce work inside the "critical section" of checkpoint is not helping in your case (we do a pass through the cache before the checkpoint starts for real, but in your workload, that just slows things down without reducing the amount of work checkpoints have to do).

We added eviction_dirty_trigger in an attempt to address #1, but as you have noted, the impact on performance for some workloads is unacceptable. We are working on addressing that in WT-2666.

We don't yet have a design for #2, multi-threaded checkpoints: to some extent we are hoping that the solution to #1 makes this unnecessary. If you squint, LSM is also shifting work out of checkpoints, and of course iibench is designed to show off write-optimized access methods. I suspect the sweet spot is to use btree for collections and LSM for indexes: I will test that configuration and let you know what I see.

For #3, you could try running with a source code change to disable the first pass through cache before the real checkpoint:

diff --git a/src/third_party/wiredtiger/src/txn/txn_ckpt.c b/src/third_party/wiredtiger/src/txn/txn_ckpt.c
index c1b435d..71cbde9 100644
--- a/src/third_party/wiredtiger/src/txn/txn_ckpt.c
+++ b/src/third_party/wiredtiger/src/txn/txn_ckpt.c
@@ -410,6 +410,7 @@ __txn_checkpoint(WT_SESSION_IMPL *session, const char *cfg[])
 	WT_ERR(__checkpoint_data_source(session, cfg));
 
 	WT_ERR(__wt_epoch(session, &verb_timer));
+#if 0
 	WT_ERR(__checkpoint_verbose_track(session,
 	    "starting write leaves", &verb_timer));
 
@@ -425,6 +426,7 @@ __txn_checkpoint(WT_SESSION_IMPL *session, const char *cfg[])
 	 * asynchronous flush as much time as possible before we wait.
 	 */
 	WT_ERR(__checkpoint_apply(session, cfg, __wt_checkpoint_sync));
+#endif
 
 	/* Tell logging that we are about to start a database checkpoint. */
 	if (full && logging)

I'll run your iibench workload next week and let you know anything else I find.

Comment by Mark Callaghan [ 02/Jun/16 ]

Another side effect of setting eviction_dirty_* options is that the insert rate drops by 10X for my insert-only test compared to not setting the options.

Comment by Mark Callaghan [ 02/Jun/16 ]

Test restarted with a 128G block
storage.wiredTiger.collectionConfig.blockCompressor: zlib
storage.wiredTiger.engineConfig.journalCompressor: none
storage.wiredTiger.engineConfig.cacheSizeGB: 128
storage.wiredTiger.engineConfig.configString: "eviction_dirty_target=10,eviction_dirty_trigger=20"

And the problem repeats with a large journal directory early in the test
156G data
122G data/journal

But eviction_dirty_

{target|trigger}

are working as requested:
es.024913: "tracked dirty bytes in the cache" : 27487934391,

Comment by Mark Callaghan [ 02/Jun/16 ]

db.serverStatus() from a WiredTiger run when RSS is ~2X the size of the WT block cache

Comment by Mark Callaghan [ 02/Jun/16 ]

I am repeating tests after setting eviction_dirty_trigger:
https://jira.mongodb.org/browse/SERVER-16737

Comment by Mark Callaghan [ 02/Jun/16 ]

Metrics file from diagnostic.data

Comment by Mark Callaghan [ 02/Jun/16 ]

Whether or not this is expected the behavior is far from desirable. For the workload explained in https://jira.mongodb.org/browse/SERVER-20306 I get a huge journal directory because the insert rate into WiredTiger is too high. Once the block cache gets full and eviction (stalls) slows the insert rate down then the journal directory shrinks.

In one case I get at least 133G in the journal when database + journal is ~190G. Once the WT block cache gets full the insert rate slows and the journal drops to less than 10G. Then I experience https://jira.mongodb.org/browse/SERVER-20306.

So WT is either using too much space in the journal directory or too much memory.

Comment by Ramon Fernandez Marina [ 19/May/15 ]

The current space usage described in this ticket is expected, but since there's room for improvement so we're repurposing this ticket to evaluate what enhancements can be made for 3.2 in this area.

Comment by Mark Callaghan [ 06/Jan/15 ]

Created feature requests:

Comment by Mark Callaghan [ 06/Jan/15 ]

Using this configuration appears to help:
storage.wiredTiger.engineConfig.configString: "eviction_target=90,eviction_trigger=95,eviction_dirty_target=85,eviction=(threads_min=4,threads_max=8)"

My understanding is eviction can be multi-threaded and with compression enabled eviction threads must compress data so more threads can be important. Checkpoint is single-threaded, although there is an option for "checkpoint_threads" in the wtperf.c client. Checkpoint also compresses given this thread stack I get while running iibench, so support for more checkpoint threads or helper threads to do the compression might help.

compress_block,_tr_flush_block,deflate_slow,deflate,zlib_compress_raw,__rec_split_raw_worker,__rec_split_raw,__rec_row_leaf_insert,__rec_row_leaf,__wt_reconcile,__sync_file,__wt_cache_op,__checkpoint_write_leaves,__checkpoint_apply,__wt_txn_checkpoint,__session_checkpoint,__ckpt_server,start_thread,clone

Comment by Mark Callaghan [ 06/Jan/15 ]

This might be the first time I read the cache docs. With the default, eviction starts when 95% of the cache limit is reached and runs until that is reduced to 80% of the limit. Throwing out 15% of the cache seems seems like a drastic change especially when we have servers with 100G and more of RAM. Will this be made more incremental in the future?

http://source.wiredtiger.com/2.4.1/tune_cache.html
http://source.wiredtiger.com/2.4.1/group__wt.html#ga9e6adae3fc6964ef837a62795c7840ed

Comment by Mark Callaghan [ 06/Jan/15 ]

I thought of trying a combination of eviction_dirty_target and eviction[threads_max=X] but without support for eviction_dirty_trigger I suspect that will make the problem worse. Instead I will just try eviction[threads_max=X] for X > 1.

Comment by Michael Cahill (Inactive) [ 06/Jan/15 ]

Thanks for the update, mdcallag. I understand that a small cache is not practical in general: I made that suggestion to get a handle on the underlying cause.

What WiredTiger needs to do here is manage the amount of work involved in checkpoints: i.e., the maximum amount of I/O required to complete a checkpoint, so that checkpoint performance is not determined by cache size.

We have a setting eviction_dirty_target to limit what fraction of the cache can be dirty pages (so that you could theoretically have a 100GB cache but never need to write more than 1GB of data to complete a checkpoint). Unfortunately, as the name suggests, this is only a target and there is no guarantee that inserts cannot swamp eviction.

InnoDB's fuzzy checkpoints are a better solution, and that includes potentially stalling application threads before they can make pages dirty (which WiredTiger only does once the cache becomes full). We are not going to have a full solution to this problem for MongoDB 2.8, but it's high on our todo list.

Comment by Mark Callaghan [ 05/Jan/15 ]

Using a small cache means that some of my tests fall over from spending too much time doing cache eviction. Here is an example from iibench with --wiredTigerCacheSizeGB 4. The doc insert rate drops from 20,000 to 400 because of this. With a larger cache it drops from 20,000 to 10,000. I am using snappy compression.

      9 pthread_cond_timedwait@@GLIBC_2.3.2,__wt_cond_wait,__wt_cache_full_check,__session_begin_transaction,mongo::WiredTigerRecoveryUnit::_txnOpen,mongo::WiredTigerRecoveryUnit::getSession,mongo::WiredTigerCursor::_init,mongo::WiredTigerRecordStore::insertRecord,mongo::Collection::_insertDocument,mongo::Collection::insertDocument,singleInsert,insertOne,mongo::WriteBatchExecutor::execOneInsert,mongo::WriteBatchExecutor::execInserts,mongo::WriteBatchExecutor::bulkExecute,mongo::WriteBatchExecutor::executeBatch,mongo::WriteCmd::run,mongo::_execCommand,mongo::Command::execCommand,mongo::_runCommands,runCommands,mongo::runQuery,receivedQuery,mongo::assembleResponse,mongo::MyMessageHandler::process,mongo::PortMessageServer::handleIncomingMsg,start_thread,clone
      1 __wt_tree_walk,__evict_walk_file,__evict_walk,__evict_lru_walk,__evict_pass,__evict_server,start_thread,clone
      1 sigwait,mongo::(anonymous,boost::(anonymous,start_thread,clone
      1 select,mongo::Listener::initAndListen,_initAndListen,mongo::initAndListen,mongoDbMain,main
      1 recv,mongo::Socket::_recv,mongo::Socket::unsafe_recv,mongo::Socket::recv,mongo::MessagingPort::recv,mongo::PortMessageServer::handleIncomingMsg,start_thread,clone
      1 pthread_cond_wait@@GLIBC_2.3.2,wait<boost::unique_lock<boost::timed_mutex>,mongo::DeadlineMonitor<mongo::V8Scope>::deadlineMonitorThread,boost::(anonymous,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,__wt_cond_wait,__wt_cache_full_check,__wt_page_in_func,__wt_page_swap_func,__wt_row_search,__cursor_row_search,__wt_btcur_insert,__curfile_insert,mongo::WiredTigerIndexStandard::_insert,mongo::WiredTigerIndex::insert,mongo::BtreeBasedAccessMethod::insert,mongo::IndexCatalog::_indexRecord,mongo::IndexCatalog::indexRecord,mongo::Collection::_insertDocument,mongo::Collection::insertDocument,singleInsert,insertOne,mongo::WriteBatchExecutor::execOneInsert,mongo::WriteBatchExecutor::execInserts,mongo::WriteBatchExecutor::bulkExecute,mongo::WriteBatchExecutor::executeBatch,mongo::WriteCmd::run,mongo::_execCommand,mongo::Command::execCommand,mongo::_runCommands,runCommands,mongo::runQuery,receivedQuery,mongo::assembleResponse,mongo::MyMessageHandler::process,mongo::PortMessageServer::handleIncomingMsg,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,__wt_cond_wait,__sweep_server,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,__wt_cond_wait,__log_server,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,__wt_cond_wait,__ckpt_server,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,timed_wait<boost::unique_lock<boost::timed_mutex>,timed_wait<boost::unique_lock<boost::timed_mutex>,,mongo::RangeDeleter::doWork,boost::(anonymous,start_thread,clone
      1 pthread_cond_timedwait@@GLIBC_2.3.2,boost::condition_variable_any::timed_wait<boost::unique_lock<boost::timed_mutex>,timed_wait<boost::unique_lock<boost::timed_mutex>,,timed_wait<boost::unique_lock<boost::timed_mutex>,,mongo::(anonymous,mongo::BackgroundJob::jobBody,boost::(anonymous,start_thread,clone
      1 nanosleep,mongo::sleepsecs,mongo::TTLMonitor::run,mongo::BackgroundJob::jobBody,boost::(anonymous,start_thread,clone
      1 nanosleep,mongo::sleepsecs,mongo::ClientCursorMonitor::run,mongo::BackgroundJob::jobBody,boost::(anonymous,start_thread,clone

Comment by Mark Callaghan [ 29/Dec/14 ]

It doesn't reproduce with --wiredTigerCacheSizeGB 1. That also cuts the document insert rate by half during my test. I also reported long checkpoint stalls from this test configuration for the WT b-tree, so I agree that might be the root cause.

Comment by Michael Cahill (Inactive) [ 29/Dec/14 ]

Hi mdcallag, we will have to take a closer look at this one, but I suspect what is happening is that checkpoints are taking a (very) long time to complete.

That's because MongoDB will configure WiredTiger with a 72GB cache on your system, and flushing that much data takes a long time (particularly with zlib compression). If you have a chance to explore this further, it would be interesting to know whether this still happens with a smaller cache (i.e. running mongod --wiredTigerCacheSizeGB 1).

Comment by Daniel Pasette (Inactive) [ 26/Dec/14 ]

Thanks for the report.
With the configuration you've described, the journal files are meant to be deleted after each checkpoint. These should be triggered every 60 secs (syncPeriodSecs) or after 2GB of data is written to the journal. We'll take a look.

Comment by Mark Callaghan [ 24/Dec/14 ]

I have not encountered this problem with the WiredTiger LSM

Generated at Thu Feb 08 03:41:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.