[SERVER-16562] Extremely slow shutdown for WiredTiger (non-LSM) Created: 16/Dec/14  Updated: 24/Dec/14  Resolved: 18/Dec/14

Status: Closed
Project: Core Server
Component/s: Concurrency, Storage
Affects Version/s: 2.8.0-rc2
Fix Version/s: 2.8.0-rc4

Type: Bug Priority: Major - P3
Reporter: Mark Callaghan Assignee: Mark Benvenuto
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

command line for starting mongod:
bin.tcm/mongod --config /data/mysql/mongo/mongo.conf --storageEngine wiredTiger

configuration file:
processManagement:
fork: true
systemLog:
destination: file
path: /data/mysql/mongo/log
logAppend: true
storage:
dbPath: /data/mysql/mongo/data
journal:
enabled: true
mmapv1:
syncPeriodSecs: 60
journal:
commitIntervalMs: 100

Test case is the insert benchmark changed to insert 2B rows (200M rows per thread, 10 insert threads) with a 2000/byte char field.
export MAX_ROWS=200000000
export LENGTH_CHAR_FIELDS=2000

Participants:

 Description   

I loaded 2B docs via the insert benchmark to create a 300G database last night. The load stopped more than 8 hours ago. This morning I did "kill $( pidof mongod )" to shutdown mongod and the shutdown took 12 minutes.

This reproduces with the default (non-LSM) engine for WiredTiger. This does not reproduce with the LSM engine. This is the thread stack for the thread doing all of the work at shutdown. My host has very fast PCIe-flash for storage.

Looks like a long checkpoint to me:
_free_update_list,free_skip_list,free_skip_array,free_page_row_leaf,wt_page_out,wt_ref_out,wt_rec_page_clean_update,evict_file,wt_cache_op,wt_checkpoint_close,wt_conn_btree_sync_and_close,wt_conn_dhandle_discard_single,wt_conn_dhandle_discard,wt_connection_close,_conn_close,mongo::WiredTigerKVEngine::cleanShutdown,shutdownServer,mongo::exitCleanly,mongo::(anonymous,boost::(anonymous,start_thread,clone

During the shutdown this error message was printed a few times:

2014-12-16T07:59:02.232-0800 I - [TTLMonitor] LockerId 411893 has been waiting to acquire lock for more than 30 seconds. MongoDB will print the lock manager state and the stack of the thread that
has been waiting, for diagnostic purposes. This message does not necessary imply that the server is experiencing an outage, but might be an indication of an overload.
2014-12-16T07:59:02.232-0800 I - [TTLMonitor] Dumping LockManager @ 0x1ad5d20
2014-12-16T07:59:02.232-0800 I - [TTLMonitor] Lock @ 0x2903580:

{2305843009213693953: Global, 1}

GRANTED:
LockRequest 0 @ 0x7F9C8B743350: Mode = X; ConvertMode = NONE; EnqueueAtFront = 1; CompatibleFirst = 1;

PENDING:
LockRequest 411893 @ 0x2BC8A00: Mode = IS; ConvertMode = NONE; EnqueueAtFront = 0; CompatibleFirst = 0;
2014-12-16T07:59:02.232-0800 I - [TTLMonitor]
2014-12-16T07:59:02.235-0800 I - [TTLMonitor]
0xf06559 0x99c74a 0x9a41cc 0x99d44b 0x993520 0x994ed0 0xd40012 0xd41392 0xe98ec0 0xf528f4 0x7f9c92c48fa8 0x7f9c91d2a5ad



 Comments   
Comment by Githook User [ 18/Dec/14 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-16562: Do fast wiredtiger shutdown
Branch: master
https://github.com/mongodb/mongo/commit/46589450c11ce5cc8cfef2eaf47342d53b271a35

Comment by Asya Kamsky [ 17/Dec/14 ]

I've seen slow shutdown as well - I'd assumed it was the last checkpoint and in fact if I sleep for 60 to 120 seconds before shutting down it takes less time. Could be both - last checkpoint and freeing memory?

Comment by Mark Callaghan [ 17/Dec/14 ]

I wonder if tuning tcmalloc can help here. Back in the day when I used it with MySQL we reduced the amount of memory it would cache per thread to avoid wasting memory for workloads with 1000+ threads.

I repeated the shutdown test after for jemalloc. I didn't not try to tune it. The shutdown took ~17 minutes. And from one sample thread stack there is some time in jemalloc code:
jemalloc_je_arena_dalloc_bin_locked,jemalloc_je_tcache_bin_flush_small,jemalloc_je_tcache_dalloc_small,jemalloc_je_arena_dalloc,jemalloc_je_idalloct,jemalloc_je_iqalloct,jemalloc_je_iqalloc,ifree,free,_wt_free_int,free_update_list,free_skip_list,free_skip_array,free_page_row_leaf,wt_page_out,wt_ref_out,wt_rec_page_clean_update,evict_file,wt_cache_op,wt_checkpoint_close,wt_conn_btree_sync_and_close,wt_conn_dhandle_discard_single,wt_conn_dhandle_discard,wt_connection_close,_conn_close,mongo::WiredTigerKVEngine::cleanShutdown,shutdownServer,mongo::exitCleanly,mongo::(anonymous,boost::(anonymous,start_thread,clone

And the top N functions from linux perf during a 10 second sample

34.07% mongod mongod [.] __wt_page_out
29.32% mongod mongod [.] jemalloc_je_tcache_bin_flush_small
17.26% mongod mongod [.] free
17.13% mongod mongod [.] jemalloc_je_arena_dalloc_bin_locked
0.55% mongod libpthread-2.17.so [.] pthread_mutex_lock
0.40% mongod libpthread-2.17.so [.] pthread_mutex_unlock
0.23% mongod mongod [.] __wt_free_int
0.14% mongod mongod [.] arena_purge
0.14% mongod mongod [.] arena_run_tree_insert
0.06% mongod [kernel.kallsyms] [k] futex_wait
0.06% mongod libc-2.17.so [.] __memmove_ssse3_back
0.05% mongod libpthread-2.17.so [.] pthread_cond_timedwait@@GLIBC_2.3.2
0.05% mongod mongod [.] arena_avail_tree_insert
0.04% mongod mongod [.] arena_chunk_dirty_remove
0.03% mongod [kernel.kallsyms] [k] _raw_spin_lock

Comment by Michael Cahill [ 16/Dec/14 ]

Hi mdcallag, what you're seeing is WiredTiger freeing memory that was allocated to cache data. There shouldn't be any I/O involved because periodic checkpoints will ensure that the cache is clean.

MongoDB defaults to allowing the WiredTiger cache to grow to half the size of system RAM, so there could be 100GB of memory to free, depending on your test system.

You may recall that you suggested we add a way for WiredTiger to shut down without freeing that memory, so there is a leak_memory configuration to WT_CONNECTION::close that would avoid all of this work. Of course, the downside of using that is that leak checking tools will report leaked memory. It is not currently used or exposed via MongoDB: we'll have a discussion about whether it can be turned on.

Comment by Mark Callaghan [ 16/Dec/14 ]

Tried hierarchical "perf record -g" but (nil) was used rather than names for calling functions so the result is garbage

Comment by Mark Callaghan [ 16/Dec/14 ]

Top-N from "perf record" with flat profiling

39.87% mongod mongod [.] __wt_page_out
29.48% mongod mongod [.] tc_free
14.09% mongod mongod [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)
8.00% mongod mongod [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)
7.61% mongod mongod [.] tcmalloc::CentralFreeList::ReleaseListToSpans(void*)
0.26% mongod mongod [.] __wt_free_int
0.11% mongod [kernel.kallsyms] [k] update_sd_lb_stats
0.06% mongod mongod [.] __wt_epoch
0.05% mongod mongod [.] tcmalloc::CentralFreeList::InsertRange(void*, void*, int)
0.04% mongod [kernel.kallsyms] [k] rcu_check_callbacks
0.03% mongod [kernel.kallsyms] [k] _raw_spin_lock
0.03% mongod [kernel.kallsyms] [k] do_timer
0.03% mongod [kernel.kallsyms] [k] cpuacct_charge
0.02% mongod [kernel.kallsyms] [k] _raw_spin_lock_irqsave
0.02% mongod [kernel.kallsyms] [k] apic_timer_interrupt
0.02% mongod [kernel.kallsyms] [k] smp_apic_timer_interrupt
0.02% mongod [kernel.kallsyms] [k] update_cfs_shares
0.02% mongod [kernel.kallsyms] [k] perf_event_task_tick
0.02% mongod [kernel.kallsyms] [k] lapic_next_deadline
0.02% mongod [kernel.kallsyms] [k] __do_softirq
0.02% mongod [kernel.kallsyms] [k] trigger_load_balance
0.01% mongod [kernel.kallsyms] [k] ktime_get_update_offsets
0.01% mongod [kernel.kallsyms] [k] native_sched_clock
0.01% mongod [kernel.kallsyms] [k] scheduler_tick
0.01% mongod [kernel.kallsyms] [k] irq_enter
0.01% mongod [kernel.kallsyms] [k] ktime_get
0.01% mongod [kernel.kallsyms] [k] rcu_irq_exit
0.01% mongod [kernel.kallsyms] [k] x86_pmu_disable
0.01% mongod [kernel.kallsyms] [k] update_context_time.isra.48
0.01% mongod [kernel.kallsyms] [k] kthread_data
0.01% mongod [kernel.kallsyms] [k] calc_global_load
0.01% mongod [kernel.kallsyms] [k] __irq_work_run
0.01% mongod [kernel.kallsyms] [k] hrtimer_interrupt
0.01% mongod [kernel.kallsyms] [k] rcu_eqs_enter_common.isra.54

Comment by Mark Callaghan [ 16/Dec/14 ]

Repeated test using fsync on commit and then grabbed thread stacks on shutdown. Here is a sample...

SLL_PopRange,PopRange,tcmalloc::ThreadCache::ReleaseToCentralCache,tcmalloc::ThreadCache::ListTooLong,Deallocate,do_free_helper,do_free_with_callback,do_free,tc_free,_wt_free_int,free_skip_list,free_skip_array,free_page_row_leaf,wt_page_out,wt_ref_out,wt_rec_page_clean_update,evict_file,wt_cache_op,wt_checkpoint_close,wt_conn_btree_sync_and_close,wt_conn_dhandle_discard_single,wt_conn_dhandle_discard,wt_connection_close,_conn_close,mongo::WiredTigerKVEngine::cleanShutdown,shutdownServer,mongo::exitCleanly,mongo::(anonymous,boost::(anonymous,start_thread,clone

_free_update_list,free_skip_list,free_skip_array,free_page_row_leaf,wt_page_out,wt_ref_out,wt_rec_page_clean_update,evict_file,wt_cache_op,wt_checkpoint_close,wt_conn_btree_sync_and_close,wt_conn_dhandle_discard_single,wt_conn_dhandle_discard,wt_connection_close,_conn_close,mongo::WiredTigerKVEngine::cleanShutdown,shutdownServer,mongo::exitCleanly,mongo::(anonymous,boost::(anonymous,start_thread,clone

tcmalloc::CentralFreeList::ReleaseToSpans,tcmalloc::CentralFreeList::ReleaseListToSpans,tcmalloc::CentralFreeList::InsertRange,tcmalloc::ThreadCache::ReleaseToCentralCache,tcmalloc::ThreadCache::ListTooLong,Deallocate,do_free_helper,do_free_with_callback,do_free,tc_free,_wt_free_int,free_update_list,free_skip_list,free_skip_array,free_page_row_leaf,wt_page_out,wt_ref_out,wt_rec_page_clean_update,evict_file,wt_cache_op,wt_checkpoint_close,wt_conn_btree_sync_and_close,wt_conn_dhandle_discard_single,wt_conn_dhandle_discard,wt_connection_close,_conn_close,mongo::WiredTigerKVEngine::cleanShutdown,shutdownServer,mongo::exitCleanly,mongo::(anonymous,boost::(anonymous,start_thread,clone

_free_update_list,free_skip_list,free_skip_array,free_page_row_leaf,wt_page_out,wt_ref_out,wt_rec_page_clean_update,evict_file,wt_cache_op,wt_checkpoint_close,wt_conn_btree_sync_and_close,wt_conn_dhandle_discard_single,wt_conn_dhandle_discard,wt_connection_close,_conn_close,mongo::WiredTigerKVEngine::cleanShutdown,shutdownServer,mongo::exitCleanly,mongo::(anonymous,boost::(anonymous,start_thread,clone

Comment by Mark Callaghan [ 16/Dec/14 ]

When this command line is used to get LSM from WT, then the shutdown is fast (10 seconds):
bin.tcm/mongod --config /data/mysql/mongo/mongo.conf --storageEngine wiredTiger --wiredTigerIndexConfig="type=lsm"

Generated at Thu Feb 08 03:41:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.