-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
ALL
- After some time we start seeing mongostat ticks with very high (sometimes >100) lock %
- Throughput falls, cpu skyrockets and io stays mostly idle
- Machine becomes very unresponsive
- The database never totally stops processing requests but is extremely slow even to call serverStatus()
- Calling db.runCommand(
{closeAllDatabases:1}
); resolves the problem
We have not had much luck isolating the cause of this behaviour but it does appear to be heavy reads that cause it, we have tested totally saturating with writes without this happening
The timeline that most recently triggered this problem is:
- Two slaves doing their initial sync
- Moderate read/write load <1mbs
We have tried running with numactrl interleave=all without any change (running with it on in this example)
#################################################################
During "incident":
Mongostat:
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time
0 0 0 0 0 6 0 191g 382g 11.6g 0 24 0 2|1 1|2 632b 3k 28 test M 15:47:53
18 4 2 0 0 10 0 191g 382g 11.6g 1 0 0 2|3 1|3 4k 4k 28 test M 15:47:57
1 8 2 0 3 8 1 191g 382g 11.6g 1 231 0 5|0 4|1 1k 30k 28 test M 15:48:01
23 1 2 0 2 8 0 191g 382g 11.6g 10 99.5 0 1|0 1|1 3k 4k 28 test M 15:48:05
1 0 1 0 0 5 0 191g 382g 11.6g 0 0 0 2|3 1|3 1k 3k 28 test M 15:48:08
0 1 0 0 0 9 0 191g 382g 11.6g 0 241 0 3|1 2|2 1k 4k 28 test M 15:48:14
1 1 0 0 0 7 0 191g 382g 11.6g 0 0 0 3|2 2|2 931b 3k 28 test M 15:48:17
1 3 2 0 2 6 0 191g 382g 11.6g 2 264 0 1|0 1|1 1k 28k 28 test M 15:48:21
26 1 1 0 11 5 0 191g 382g 11.6g 65 75.6 0 0|0 1|0 4k 2k 28 test M 15:48:24
0 9 3 0 7 19 0 191g 382g 11.7g 40 22.2 0 0|0 1|0 3k 9k 28 test M 15:48:39
0 0 0 0 0 8 0 191g 382g 11.7g 0 0 0 3|3 2|3 759b 3k 28 test M 15:48:44
3 2 1 0 0 5 0 191g 382g 11.7g 27 174 0 4|0 4|0 2k 2k 28 test M 15:48:47
0 1 1 0 6 4 0 191g 382g 11.7g 546 59.7 0 0|0 1|0 963b 391k 28 test M 15:48:48
0 0 0 0 7 3 0 191g 382g 11.8g 425 0 0 0|0 2|0 304b 4m 28 test M 15:48:49
0 1 1 0 4 4 0 191g 382g 11.8g 250 0 0 0|0 1|0 489b 3k 28 test M 15:48:50
3 0 1 0 0 3 0 191g 382g 11.8g 372 0 0 0|0 1|0 1k 1k 28 test M 15:48:51
0 1 2 0 1 7 0 191g 382g 11.8g 121 0 0 0|1 1|1 821b 3k 28 test M 15:48:56
6 5 2 0 2 15 1 191g 382g 11.8g 78 0 0 0|0 1|0 2k 30k 28 test M 15:49:06
11 4 1 0 2 6 0 191g 382g 11.8g 42 3.7 0 0|1 1|1 2k 28k 28 test M 15:49:07
iostat
avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 7.66 0.00 0.00 92.34
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 5.50 102.00 0.00 204 0
sdb 0.00 0.00 0.00 0 0
sdc 0.00 0.00 0.00 0 0
sdd 4.00 0.00 2048.00 0 4096
sde 0.00 0.00 0.00 0 0
sdf 1.00 0.00 8.00 0 16
dm-0 0.00 0.00 0.00 0 0
vmstat
procs ----------memory--------- --swap- ----io--- system- ---cpu---
r b swpd free buff cache si so bi bo in cs us sy id wa
1 0 88 77448 21336 23876708 0 0 768 189 120 87 0 6 94 1
3 0 88 76748 21352 23876740 0 0 162 1090 672 833 0 14 86 0
1 0 88 76696 21352 23876628 0 0 0 2164 634 738 0 5 95 0
1 0 88 76952 21360 23876860 0 0 0 2084 666 834 0 9 91 0
1 0 88 77200 21360 23876860 0 0 0 1034 591 725 0 10 90 0
1 0 88 77448 21372 23876852 0 0 0 2062 669 826 0 10 90 0
1 0 88 77572 21384 23876848 0 0 64 2164 718 920 0 5 95 0
1 0 88 77572 21384 23876852 0 0 0 1028 550 655 0 7 93 0
1 0 88 78044 21392 23876988 0 0 0 2056 568 686 0 6 94 0
1 0 88 77392 21392 23876908 0 0 236 1038 719 882 0 10 89 0
2 0 88 77408 21392 23876904 0 0 0 2048 600 706 0 8 92 0
2 0 88 77536 21392 23876908 0 0 0 2054 606 703 0 6 94 0
1 0 88 77872 21408 23876912 0 0 0 1050 657 749 0 12 87 0
Log file shows lots of
serverStatus was very slow:
#################################################################
After calling close files:
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time
0 152 1 0 0 5 0 191g 382g 108m 0 0 0 0|0 0|0 19k 399k 21 test M 15:51:59
0 152 1 0 0 7 0 191g 382g 108m 0 0 0 0|0 0|0 19k 393k 21 test M 15:52:00
0 294 0 0 0 6 0 191g 382g 108m 0 0 0 0|0 0|0 37k 788k 21 test M 15:52:01
29 172 0 0 2 7 0 191g 382g 108m 0 0 0 0|0 0|0 30k 516k 21 test M 15:52:02
11 180 0 0 0 6 0 191g 382g 108m 1 0.5 0 0|0 0|0 23k 431k 21 test M 15:52:03
0 193 1 0 0 9 0 191g 382g 108m 0 0 0 0|0 0|0 29k 515k 21 test M 15:52:04
0 192 1 0 0 7 0 191g 382g 108m 0 0 0 0|0 0|0 29k 519k 21 test M 15:52:05
0 260 0 0 0 8 0 191g 382g 108m 0 0 0 0|0 0|0 34k 743k 21 test M 15:52:06
0 221 0 0 0 6 0 191g 382g 108m 0 0 0 0|0 0|0 31k 537k 21 test M 15:52:07
0 179 1 0 0 6 0 191g 382g 108m 0 0 0 0|0 0|0 28k 508k 21 test M 15:52:08
cpu returns to mostly idle with this load