Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-871

test3 500m killed by oom-killer

    • Type: Icon: Task Task
    • Resolution: Done
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None

      The linux oom-killer has twice killed test3 running on the 500m workload. Test3 is a 50/50 read/update mix with a pareto distribution. I have it configured such that the thread pool is divided 50/50 rather than the workload. I.e. I have 10 read-only threads and 10 update-only threads rather than a pool of 20 threads with a 50/50 ratio. Those two workloads perform very differently. At the time of the crash we were doing about 50x more updates than reads (264:13910 from test.stat for a 10 second period) and the reads show max and average latencies of > 0.5 seconds. However, a throughput rate of just 1410 ops/sec is pretty slow overall. The sample size for reads in the monitor file is probably pretty small though.

      Test 2 and 4 run fine to completion for the same duration. I looked at the sar output, which I now have running every 3 minutes (in this case 06:54 and 06:57), but the large numbers at 06:57 I believe are mostly memory changes due to the process getting killed and massive amounts of resources being reclaimed. It is possible I'm missing a nugget of "cause" in the haystack of "effect" though.

      I am rerunning with jemalloc to see if that gets killed too. The first kill from Jan 31 was about 8300 seconds into the test. Last night's kill was 4360 seconds into the run (which is 4 hours, 14400 seconds).

      The cachesize is 21Gb, and the last stats logged (exactly 1 minute before the kill) show about 18Gb currently in the cache and no LSM rows merged.

      In any case, here's the info from /var/log/messages and yes, it shows at the end the total-vm used by wtperf is 60Gb. From what I've researched, oom-killer was invoked because at 1314265.902153, 'free' fell below 'min'.

      There's a lot of info below. But if anyone has any suggestions, let me know!

      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900388] wtperf invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900408] wtperf cpuset=/ mems_allowed=0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900416] Pid: 13226, comm: wtperf Not tainted 3.4.76-65.111.amzn1.x86_64 WT-1
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900424] Call Trace:
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900440]  [<ffffffff81106c9e>] dump_header.constprop.6+0x7e/0x1b0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900451]  [<ffffffff813e8e1e>] ? _raw_spin_unlock_irqrestore+0x1e/0x30
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900461]  [<ffffffff812341c0>] ? ___ratelimit+0xa0/0x120
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900469]  [<ffffffff8110704d>] oom_kill_process.part.4.constprop.5+0x13d/0x280
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900480]  [<ffffffff811e2165>] ? security_capable_noaudit+0x15/0x20
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900491]  [<ffffffff810547d7>] ? has_capability_noaudit+0x17/0x20
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900501]  [<ffffffff81107687>] out_of_memory+0x367/0x540
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900510]  [<ffffffff8110c7e1>] __alloc_pages_nodemask+0x8e1/0x900
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900520]  [<ffffffff81142ce6>] ? alloc_pages_current+0xb6/0x120
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900528]  [<ffffffff81142ce6>] alloc_pages_current+0xb6/0x120
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900537]  [<ffffffff811032df>] __page_cache_alloc+0xcf/0xf0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900545]  [<ffffffff81105425>] generic_file_aio_read+0x505/0x760
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900556]  [<ffffffff811287fc>] ? handle_pte_fault+0x17c/0x970
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900591]  [<ffffffffa029de96>] xfs_file_aio_read+0x156/0x340 [xfs]
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900603]  [<ffffffff8115934a>] do_sync_read+0xda/0x120
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900611]  [<ffffffff811e2873>] ? security_file_permission+0x93/0xb0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900619]  [<ffffffff811597d1>] ? rw_verify_area+0x61/0xf0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900627]  [<ffffffff81159cb0>] vfs_read+0xb0/0x180
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900635]  [<ffffffff81159f42>] sys_pread64+0xa2/0xb0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900643]  [<ffffffff813f0eb9>] system_call_fastpath+0x16/0x1b
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900650] Mem-Info:
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900655] Node 0 DMA per-cpu:
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900661] CPU    0: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900667] CPU    1: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900674] CPU    2: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900680] CPU    3: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900686] CPU    4: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900692] CPU    5: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900698] CPU    6: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900703] CPU    7: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900712] CPU    8: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900719] CPU    9: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900724] CPU   10: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900730] CPU   11: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900736] CPU   12: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900742] CPU   13: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900747] CPU   14: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900753] CPU   15: hi:    0, btch:   1 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900759] Node 0 DMA32 per-cpu:
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900765] CPU    0: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900770] CPU    1: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900776] CPU    2: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900781] CPU    3: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900787] CPU    4: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900794] CPU    5: hi:  186, btch:  31 usd:   1
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900799] CPU    6: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900805] CPU    7: hi:  186, btch:  31 usd:  12
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900811] CPU    8: hi:  186, btch:  31 usd:   6
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900816] CPU    9: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900822] CPU   10: hi:  186, btch:  31 usd:  30
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900828] CPU   11: hi:  186, btch:  31 usd:  30
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900834] CPU   12: hi:  186, btch:  31 usd:   5
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900840] CPU   13: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900846] CPU   14: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900852] CPU   15: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900857] Node 0 Normal per-cpu:
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900863] CPU    0: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900870] CPU    1: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.900877] CPU    2: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902004] CPU    3: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902010] CPU    4: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902014] CPU    5: hi:  186, btch:  31 usd:  85
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902018] CPU    6: hi:  186, btch:  31 usd:  31
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902022] CPU    7: hi:  186, btch:  31 usd:  77
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902026] CPU    8: hi:  186, btch:  31 usd:  99
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902030] CPU    9: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902036] CPU   10: hi:  186, btch:  31 usd:  27
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902040] CPU   11: hi:  186, btch:  31 usd: 119
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902044] CPU   12: hi:  186, btch:  31 usd:   3
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902048] CPU   13: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902052] CPU   14: hi:  186, btch:  31 usd:  29
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902056] CPU   15: hi:  186, btch:  31 usd:   0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902063] active_anon:15371437 inactive_anon:14 isolated_anon:0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902064]  active_file:0 inactive_file:0 isolated_file:3
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902065]  unevictable:0 dirty:0 writeback:0 unstable:0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902066]  free:66785 slab_reclaimable:6482 slab_unreclaimable:9054
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902067]  mapped:75 shmem:16 pagetables:30563 bounce:0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902085] Node 0 DMA free:8064kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0k
      B active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:7808kB mlocked:0kB dirty:0kB writeback:0k
      B mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0k
      B pages_scanned:0 all_unreclaimable? yes
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902111] lowmem_reserve[]: 0 4016 61023 61023
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902120] Node 0 DMA32 free:229956kB min:2080kB low:2600kB high:3120kB active_anon:3742400
      kB inactive_anon:0kB active_file:8kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:4112640kB mlocked:0kB 
      dirty:0kB writeback:0kB mapped:52kB shmem:0kB slab_reclaimable:4496kB slab_unreclaimable:3052kB kernel_stack:48kB pagetables:7272kB unstabl
      e:0kB bounce:0kB writeback_tmp:0kB pages_scanned:194 all_unreclaimable? yes
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902145] lowmem_reserve[]: 0 0 57007 57007
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902153] Node 0 Normal free:29120kB min:29532kB low:36912kB high:44296kB active_anon:5774
      3348kB inactive_anon:56kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):12kB present:58375296kB mlock
      ed:0kB dirty:0kB writeback:0kB mapped:248kB shmem:64kB slab_reclaimable:21432kB slab_unreclaimable:33164kB kernel_stack:1376kB pagetables:1
      14980kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1217 all_unreclaimable? yes
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902179] lowmem_reserve[]: 0 0 0 0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902186] Node 0 DMA: 2*4kB 3*8kB 0*16kB 3*32kB 4*64kB 2*128kB 3*256kB 1*512kB 2*1024kB 2*
      2048kB 0*4096kB = 8064kB
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902203] Node 0 DMA32: 47*4kB 139*8kB 164*16kB 200*32kB 169*64kB 138*128kB 107*256kB 80*5
      12kB 58*1024kB 25*2048kB 3*4096kB = 230036kB
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902222] Node 0 Normal: 6055*4kB 0*8kB 0*16kB 0*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*102
      4kB 1*2048kB 0*4096kB = 28252kB
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902240] 0 total pagecache pages
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902243] 0 pages in swap cache
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902247] Swap cache stats: add 0, delete 0, find 0/0
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902251] Free swap  = 0kB
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314265.902254] Total swap = 0kB
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.099960] 15874032 pages RAM
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.099976] 332155 pages reserved
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.099994] 309 pages shared
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100003] 15474210 pages non-shared
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100008] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100030] [ 1414]     0  1414     4296      115   1     -17         -1000 udevd
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100055] [ 1536]     0  1536     4262      112   5     -17         -1000 udevd
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100072] [ 1540]     0  1540     4295      114   2     -17         -1000 udevd
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100090] [ 1896]     0  1896     2281      123  12       0             0 dhclient
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100103] [ 1939]     0  1939    22267       67  11     -17         -1000 auditd
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100119] [ 1954]     0  1954    62368      496   2       0             0 rsyslogd
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100134] [ 1965]     0  1965     3273       92   1       0             0 irqbalance
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100153] [ 1976]    81  1976     5354       58   0       0             0 dbus-daemon
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100169] [ 2002]     0  2002    47324      201   0       0             0 cupsd
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100185] [ 2039]     0  2039    19420      196   0     -17         -1000 sshd
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100200] [ 2057]    38  2057     7652      147  12       0             0 ntpd
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100214] [ 2072]     0  2072    22259      424   0       0             0 sendmail
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100227] [ 2079]    51  2079    20124      360   5       0             0 sendmail
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100240] [ 2087]     0  2087    29315      153   9       0             0 crond
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100256] [ 2097]     0  2097     5388       46   7       0             0 atd
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100275] [ 2116]     0  2116     1022       23   1       0             0 agetty
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100290] [ 2117]     0  2117     1019       22   0       0             0 mingetty
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100304] [ 2121]     0  2121     1019       23   3       0             0 mingetty
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100318] [ 2124]     0  2124     1019       23   8       0             0 mingetty
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100332] [ 2126]     0  2126     1019       23   9       0             0 mingetty
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100351] [ 2128]     0  2128     1019       22   7       0             0 mingetty
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100366] [ 2130]     0  2130     1019       22  10       0             0 mingetty
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100385] [26156]   503 26156    29771      173  12       0             0 screen
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100398] [26157]   503 26157    27643      142   1       0             0 bash
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100411] [ 4501]   503  4501    27051       60   3       0             0 runwt1234.sh
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100419] [ 4561]   503  4561    29735      132  14       0             0 screen
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100428] [ 4562]   503  4562    27644      143  15       0             0 bash
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100437] [13198]   503 13198 15712121 15367759   9       0             0 wtperf
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100445] Out of memory: Kill process 13198 (wtperf) score 990 or sacrifice child
      Feb  6 06:56:26 ip-10-148-146-194 kernel: [1314266.100454] Killed process 13198 (wtperf) total-vm:62848484kB, anon-rss:61471036kB, file-rss
      :0kB
      

        1. monitor.png
          monitor.png
          7 kB
        2. read latency1.png
          read latency1.png
          13 kB
        3. read latency1.png
          read latency1.png
          12 kB
        4. update latency1.png
          update latency1.png
          12 kB

            Assignee:
            Unassigned Unassigned
            Reporter:
            sue.loverso@mongodb.com Susan LoVerso
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: