[SERVER-16356] Idle mongod with WiredTiger LSM is using 100% CPU Created: 30/Nov/14  Updated: 10/Feb/17  Resolved: 15/Dec/14

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 2.8.0-rc1
Fix Version/s: 2.8.0-rc3

Type: Bug Priority: Major - P3
Reporter: Steve Briskin (Inactive) Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: wiredtiger
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

A 2.8.0-rc1 mongod with wired tiger in LSM mode is using 100% CPU. It isn't doing any user operations.

Start command:

/usr/share/mongodb-linux-x86_64-2.8.0-rc1/bin/mongod --dbpath ./wttiger3/ --port 30020 --logpath ./wttiger3/log.log --storageEngine=wiredTiger --wiredTigerCollectionConfig type=lsm,block_compressor= --fork

Top:

top - 16:12:02 up 16 days,  7:05,  7 users,  load average: 1.19, 1.15, 1.24
Tasks: 354 total,   2 running, 351 sleeping,   0 stopped,   1 zombie
%Cpu(s): 11.2 us,  0.6 sy,  0.0 ni, 88.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  24681876 total, 24451416 used,   230460 free,    29884 buffers
KiB Swap:  8388604 total,   581668 used,  7806936 free.   842948 cached Mem
 
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 6778 steve     20   0  0.120t 0.011t 254492 S 123.9 45.8 138:54.59 mongod
23246 steve     20   0 1540472 779192  17560 S  21.3  3.2 572:54.92 chrome
17126 steve     20   0  903068 196988  10688 S  10.3  0.8 494:15.92 chrome

> db.currentOp()
{ "inprog" : [ ] }

> db.serverBuildInfo()
{
	"version" : "2.8.0-rc1",
	"gitVersion" : "c8171e7f969519af8b87a43425ae291ee69a0191",
	"OpenSSLVersion" : "",
	"sysInfo" : "Linux build14.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49",
	"loaderFlags" : "-fPIC -pthread -Wl,-z,now -rdynamic",
	"compilerFlags" : "-Wnon-virtual-dtor -Woverloaded-virtual -std=c++11 -fPIC -fno-strict-aliasing -ggdb -pthread -Wall -Wsign-compare -Wno-unknown-pragmas -Winvalid-pch -pipe -Werror -O3 -Wno-unused-local-typedefs -Wno-unused-function -Wno-deprecated-declarations -Wno-unused-but-set-variable -fno-builtin-memcmp -std=c99",
	"allocator" : "tcmalloc",
	"versionArray" : [
		2,
		8,
		0,
		-9
	],
	"javascriptEngine" : "V8",
	"bits" : 64,
	"debug" : false,
	"maxBsonObjectSize" : 16777216,
	"ok" : 1
}



 Comments   
Comment by Michael Cahill [ 12/Dec/14 ]

We have made some improvements in this code that will appear in today's drop of WT. That said, this is an LSM issue, which we don't plan to support in 2.8.

Comment by Steve Briskin (Inactive) [ 01/Dec/14 ]

I was able to repro it: 4 threads doing inserts (1M 20KB documents), each writing to a single collection in a different DB. And 1 thread doing a $natural order read and inserts. 12 hours after end of the test the mongod is using 120% CPU (on a 12 core box) and no disk IO.

(gdb) thread apply all where
 
Thread 15 (Thread 0x7fd4135b4700 (LWP 11684)):
#0  0x00007fd4148800d1 in do_sigwait (sig=0x7fd4135b39f0, set=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:60
#1  __sigwait (set=0x1b2ae60, sig=0x7fd4135b39f0) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:97
#2  0x0000000000f40418 in ?? ()
#3  0x0000000000f8e994 in ?? ()
#4  0x00007fd414878182 in start_thread (arg=0x7fd4135b4700) at pthread_create.c:312
#5  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 14 (Thread 0x7fd412db3700 (LWP 11685)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x000000000134823f in __wt_cond_wait ()
#2  0x00000000012dc783 in ?? ()
#3  0x00007fd414878182 in start_thread (arg=0x7fd412db3700) at pthread_create.c:312
#4  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 13 (Thread 0x7fd4125b2700 (LWP 11686)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x000000000134823f in __wt_cond_wait ()
#2  0x0000000001317184 in ?? ()
#3  0x00007fd414878182 in start_thread (arg=0x7fd4125b2700) at pthread_create.c:312
#4  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 12 (Thread 0x7fd411db1700 (LWP 11687)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x000000000134823f in __wt_cond_wait ()
#2  0x0000000001314ffc in ?? ()
#3  0x00007fd414878182 in start_thread (arg=0x7fd411db1700) at pthread_create.c:312
#4  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 11 (Thread 0x7fd4115b0700 (LWP 11688)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x000000000134823f in __wt_cond_wait ()
#2  0x0000000001312da6 in ?? ()
#3  0x00007fd414878182 in start_thread (arg=0x7fd4115b0700) at pthread_create.c:312
#4  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 10 (Thread 0x7fd410daf700 (LWP 11689)):
#0  0x00007fd4139708e3 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000001349228 in __wt_sleep ()
#2  0x000000000133cceb in ?? ()
#3  0x00007fd414878182 in start_thread (arg=0x7fd410daf700) at pthread_create.c:312
---Type <return> to continue, or q <return> to quit---
#4  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 9 (Thread 0x7fd4105ae700 (LWP 11690)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x000000000134823f in __wt_cond_wait ()
#2  0x00000000013430f7 in ?? ()
#3  0x00007fd414878182 in start_thread (arg=0x7fd4105ae700) at pthread_create.c:312
#4  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 8 (Thread 0x7fd40fdad700 (LWP 11691)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007fd41487a657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2  0x00007fd41487a480 in __GI___pthread_mutex_lock (mutex=0x4b7a480) at ../nptl/pthread_mutex_lock.c:79
#3  0x0000000001342227 in ?? ()
#4  0x0000000001342d7c in __wt_lsm_free_chunks ()
#5  0x0000000001343296 in ?? ()
#6  0x00007fd414878182 in start_thread (arg=0x7fd40fdad700) at pthread_create.c:312
#7  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 7 (Thread 0x7fd40f5ac700 (LWP 11692)):
#0  0x0000000001364940 in __wt_session_get_btree ()
#1  0x0000000001313f34 in __wt_conn_dhandle_close_all ()
#2  0x0000000001354575 in __wt_schema_drop ()
#3  0x000000000134223f in ?? ()
#4  0x0000000001342d7c in __wt_lsm_free_chunks ()
#5  0x0000000001343296 in ?? ()
#6  0x00007fd414878182 in start_thread (arg=0x7fd40f5ac700) at pthread_create.c:312
#7  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 6 (Thread 0x7fd40edab700 (LWP 11693)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x0000000000eb8dbb in mongo::DeadlineMonitor<mongo::V8Scope>::deadlineMonitorThread() ()
#2  0x0000000000f8e994 in ?? ()
#3  0x00007fd414878182 in start_thread (arg=0x7fd40edab700) at pthread_create.c:312
#4  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 5 (Thread 0x7fd40e5aa700 (LWP 11694)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x0000000000bf1eef in mongo::RangeDeleter::doWork() ()
#2  0x0000000000f8e994 in ?? ()
#3  0x00007fd414878182 in start_thread (arg=0x7fd40e5aa700) at pthread_create.c:312
#4  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
---Type <return> to continue, or q <return> to quit---
 
Thread 4 (Thread 0x7fd40dda9700 (LWP 11695)):
#0  0x00007fd41487fb9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000f49000 in mongo::sleepsecs(int) ()
#2  0x0000000000d6c72f in mongo::TTLMonitor::run() ()
#3  0x0000000000ed0e49 in mongo::BackgroundJob::jobBody() ()
#4  0x0000000000f8e994 in ?? ()
#5  0x00007fd414878182 in start_thread (arg=0x7fd40dda9700) at pthread_create.c:312
#6  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 3 (Thread 0x7fd40d5a8700 (LWP 11696)):
#0  0x00007fd41487fb9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000f49000 in mongo::sleepsecs(int) ()
#2  0x00000000009153a7 in mongo::ClientCursorMonitor::run() ()
#3  0x0000000000ed0e49 in mongo::BackgroundJob::jobBody() ()
#4  0x0000000000f8e994 in ?? ()
#5  0x00007fd414878182 in start_thread (arg=0x7fd40d5a8700) at pthread_create.c:312
#6  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 2 (Thread 0x7fd40cda7700 (LWP 11697)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
#1  0x0000000000ed3dc6 in bool boost::condition_variable_any::timed_wait<boost::unique_lock<boost::timed_mutex> >(boost::unique_lock<boost::timed_mutex>&, boost::posix_time::ptime const&) ()
#2  0x0000000000ed19d6 in ?? ()
#3  0x0000000000ed0e49 in mongo::BackgroundJob::jobBody() ()
#4  0x0000000000f8e994 in ?? ()
#5  0x00007fd414878182 in start_thread (arg=0x7fd40cda7700) at pthread_create.c:312
#6  0x00007fd413978fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
 
Thread 1 (Thread 0x7fd414c83bc0 (LWP 11683)):
#0  0x00007fd4139708e3 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000ef6cac in mongo::Listener::initAndListen() ()
#2  0x00000000007db8f9 in mongo::initAndListen(int) ()
#3  0x00000000007df304 in main ()
 
(gdb) info thread
  Id   Target Id         Frame 
  15   Thread 0x7fd4135b4700 (LWP 11684) "mongod" 0x00007fd4148800d1 in do_sigwait (sig=0x7fd4135b39f0, set=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/../../../../../sysdeps/unix/sysv/linux/sigwait.c:60
  14   Thread 0x7fd412db3700 (LWP 11685) "mongod" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  13   Thread 0x7fd4125b2700 (LWP 11686) "mongod" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  12   Thread 0x7fd411db1700 (LWP 11687) "mongod" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  11   Thread 0x7fd4115b0700 (LWP 11688) "mongod" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  10   Thread 0x7fd410daf700 (LWP 11689) "mongod" 0x00007fd4139708e3 in select () at ../sysdeps/unix/syscall-template.S:81
  9    Thread 0x7fd4105ae700 (LWP 11690) "mongod" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  8    Thread 0x7fd40fdad700 (LWP 11691) "mongod" 0x0000000001364940 in __wt_session_get_btree ()
  7    Thread 0x7fd40f5ac700 (LWP 11692) "mongod" __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
  6    Thread 0x7fd40edab700 (LWP 11693) "mongod" pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
  5    Thread 0x7fd40e5aa700 (LWP 11694) "mongod" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
  4    Thread 0x7fd40dda9700 (LWP 11695) "mongod" 0x00007fd41487fb9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
  3    Thread 0x7fd40d5a8700 (LWP 11696) "mongod" 0x00007fd41487fb9d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
  2    Thread 0x7fd40cda7700 (LWP 11697) "mongod" pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:238
* 1    Thread 0x7fd414c83bc0 (LWP 11683) "mongod" 0x00007fd4139708e3 in select () at ../sysdeps/unix/syscall-template.S:81
 

Comment by Michael Cahill [ 01/Dec/14 ]

Seconded: if we can get gdb "thread apply all where" output from the instance, that would be a good first step to figuring out what's going on.

Comment by Scott Hernandez (Inactive) [ 30/Nov/14 ]

Yes, it should result in IO activity but it wouldn't require both to be in sync. If you have a mongodb instance showing these signs a gdb thread/stack dump with thread cpu info would be very helpful to identify the threads using the cpu.

Also, did the instance settle down over time? A graph of cpu and io over time would be very helpful to see a more full picture of activity/behavior. If you have a repro that would be very helpful as well – even if you don't have something we can run, a characterization of the workload would be good; were they all inserts for example.

Comment by Steve Briskin (Inactive) [ 30/Nov/14 ]

I wrote ~4.5M documents totaling ~100GB. But that was hours before I collected the top output. The disk was also idle. If WT was doing background activity I expected it to show up via IO activity...is that not the case?

Comment by Scott Hernandez (Inactive) [ 30/Nov/14 ]

Have you done writes on that system? If so I would call this expected behavior; the whole point of the LSM is to collect temporary log-based entries which later, in the background via other threads, write back to the main storage tree when writes/reads aren't active so reads will be faster in the future: http://source.wiredtiger.com/2.3.1/lsm.html

I think we have some stats in server status for this, but haven't looked lately. If not, we probably should...

Generated at Thu Feb 08 03:40:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.