[SERVER-19673] Excessive memory allocated by WiredTiger journal Created: 30/Jul/15  Updated: 11/Jan/16  Resolved: 07/Aug/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.5
Fix Version/s: 3.0.6, 3.1.7

Type: Bug Priority: Critical - P2
Reporter: Frederick Cheung Assignee: Susan LoVerso
Resolution: Done Votes: 2
Labels: WTmem, mms-s
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

amazon linux 2015.03 on r3.2xl instances


Attachments: HTML File collstats     HTML File log     File mongod.conf    
Issue Links:
Depends
depends on WT-2007 Consider how much memory is consumed ... Closed
Duplicate
is duplicated by SERVER-17456 Mongodb 3.0 wiredTiger storage engine... Closed
Related
is related to SERVER-19339 crash with this log Closed
Backwards Compatibility: Fully Compatible
Operating System: Linux
Backport Completed:
Participants:

 Description   

We have a mongodb replica set consisting of 2 servers + 1 arbiter, all running mongodb 3.0.4, with the wired tiger storage engine.

I updated the secondary to 3.0.5. It appeared to start syncing from the primary (the state transitioned to secondary and the logs said it was syncing from the primary) but crashed after about 40 seconds with this output in the logs:

2015-07-30T19:56:36.491+0000 E STORAGE  [rsSync] WiredTiger (12) [1438286196:491627][2537:0x7f07a2c1b700], session.commit_transaction: memory allocation: Cannot allocate memory
2015-07-30T19:56:36.526+0000 I -        [rsSync] Invariant failure: s->commit_transaction(s, NULL) resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp 305
2015-07-30T19:56:36.542+0000 I CONTROL  [rsSync] 
 0xf77ca9 0xf16b61 0xefd29a 0xd9727c 0xd9738a 0x930c3c 0xc512b2 0xcbba40 0xcbcc68 0xcb53af 0xfc4e64 0x7f07b3f5cdf3 0x7f07b2a101bd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B77CA9"},{"b":"400000","o":"B16B61"},{"b":"400000","o":"AFD29A"},{"b":"400000","o":"99727C"},{"b":"400000","o":"99738A"},{"b":"400000","o":"530C3C"},{"b":"400000","o":"8512B2"},{"b":"400000","o":"8BBA40"},{"b":"400000","o":"8
BCC68"},{"b":"400000","o":"8B53AF"},{"b":"400000","o":"BC4E64"},{"b":"7F07B3F55000","o":"7DF3"},{"b":"7F07B291A000","o":"F61BD"}],"processInfo":{ "mongodbVersion" : "3.0.5", "gitVersion" : "8bc4ae20708dbb493cb09338d9e7be6698e4a3a3", "uname" : { "sysname" : "
Linux", "release" : "3.14.44-32.39.amzn1.x86_64", "version" : "#1 SMP Thu Jun 11 20:33:38 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "E8A046532FE2C5A8474250BA8E5B72C6C84B9D68" }, { "b" : "7FFC6EC17000", "elfTy
pe" : 3, "buildId" : "3C9325999495818299EC622C63D9330D234E02C9" }, { "b" : "7F07B3F55000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "D48D3E6672A77B603B402F661BABF75E90AD570B" }, { "b" : "7F07B3CE8000", "path" : "/usr/lib64/libssl.so.10",
 "elfType" : 3, "buildId" : "22480480235F3B1C6C2E5E5953949728676D3796" }, { "b" : "7F07B3903000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "F1C146B78505646930DD9003AA2B3622C5226D1B" }, { "b" : "7F07B36FB000", "path" : "/lib64/librt.so.1"
, "elfType" : 3, "buildId" : "E81013CBFA409053D58A65A0653271AB665A4619" }, { "b" : "7F07B34F7000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "62A8842157C62F95C3069CBF779AFCC26577A99A" }, { "b" : "7F07B31F3000", "path" : "/usr/lib64/libstdc++.s
o.6", "elfType" : 3, "buildId" : "DD6383EEAC49E9BAA9E3D1080AE932F42CF8A385" }, { "b" : "7F07B2EF1000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "5F97F8F8E5024E29717CF35998681F84D4A22D45" }, { "b" : "7F07B2CDB000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "C52958E393BDF8E8D090F36DE0F4E620D8736FBF" }, { "b" : "7F07B291A000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "DF6DA145A649EA093507A635AF383F608E7CE3F2" }, { "b" : "7F07B4171000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "6F90843B9087FE91955FEB0355EB0858EF9E97B2" }, { "b" : "7F07B26D7000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "DE5A9F7A11A0881CB64E375F4DDCA58028F0FAF8" }, { "b" : "7F07B23F2000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "A3E43FC66908AC8B00773707FECA3B1677AFF311" }, { "b" : "7F07B21EF000", "path" : "/usr/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "622F315EB5CB2F791E9B64020692EBA98195D06D" }, { "b" : "7F07B1FC4000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "B10FBFEC246C4EAD1719D16090D0BE54904BBFC9" }, { "b" : "7F07B1DAE000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "89C6AF118B6B4FB6A73AE1813E2C8BDD722956D1" }, { "b" : "7F07B1BA3000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "7292C0673D7C116E3389D3FFA67087A6B9287A71" }, { "b" : "7F07B19A0000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "37A58210FA50C91E09387765408A92909468D25B" }, { "b" : "7F07B1786000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "6A7DA1CED90F65F27CB7B5BACDBB1C386C05F592" }, { "b" : "7F07B1565000", "path" : "/usr/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "F5054DC94443326819FBF3065CFDF5E4726F57EE" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf77ca9]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf16b61]
 mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0xDA) [0xefd29a]
 mongod(_ZN5mongo22WiredTigerRecoveryUnit9_txnCloseEb+0x2EC) [0xd9727c]
 mongod(_ZN5mongo22WiredTigerRecoveryUnit7_commitEv+0x8A) [0xd9738a]
 mongod(_ZN5mongo15WriteUnitOfWork6commitEv+0x1C) [0x930c3c]
 mongod(_ZN5mongo4repl15writeOpsToOplogEPNS_16OperationContextERKSt5dequeINS_7BSONObjESaIS4_EE+0x282) [0xc512b2]
 mongod(_ZN5mongo4repl8SyncTail10multiApplyEPNS_16OperationContextERSt5dequeINS_7BSONObjESaIS5_EE+0x460) [0xcbba40]
 mongod(_ZN5mongo4repl8SyncTail16oplogApplicationEv+0x118) [0xcbcc68]
 mongod(_ZN5mongo4repl13runSyncThreadEv+0x1EF) [0xcb53af]
 mongod(+0xBC4E64) [0xfc4e64]
 libpthread.so.0(+0x7DF3) [0x7f07b3f5cdf3]
 libc.so.6(clone+0x6D) [0x7f07b2a101bd]
-----  END BACKTRACE  -----
2015-07-30T19:56:36.542+0000 I -        [rsSync] 
 
***aborting after invariant() failure

I restarted mongodb and the same thing happened. I downgraded to mongodb 3.0.4 and the secondary was able to rejoin the replica set and catch up as normal.

The secondary wasn't under any read load at the time, but the primary was under a heavy write load



 Comments   
Comment by Frederick Cheung [ 11/Aug/15 ]

Ah, even on our 3.0.4 instance that metric was 13G on our secondary.

Comment by Bruce Lucas (Inactive) [ 11/Aug/15 ]

The key statistic for diagnosing this issue is the following:

db.serverStatus().wiredTiger.log['total log buffer size']

This is memory used by mongod for the WT journal, and is not limited by the WT cache setting. If it is large (e.g. GBs) OOM may result. The fix for this issue limits the amount of memory used by the WT journal to (I believe) 32 MB.

Comment by Ramon Fernandez Marina [ 07/Aug/15 ]

The fix for WT-2007 made it into wiredtiger's mongodb-3.0 branch:

commit fad2eacb13ebeb0dfd3a001ca45d82eff83a9af3
Author: sueloverso <sue@mongodb.com>
Date:   Wed Jul 15 10:20:59 2015 -0400
 
    WT-2007 Statically allocate log slot buffers to a maximum size.
    
    Merge pull request #2062 from wiredtiger/log-buf-static-max
    
    (cherry picked from commit e1d8bc7fa41efe1d4765cca2b9857907802fcf1b)

This was merged into mongo's master and v3.0 branches, so I'm resolving this ticket.

Comment by Michael Cahill (Inactive) [ 03/Aug/15 ]

The backport of WT-2007 has been completed, testing the current (small) set of WT updates for 3.0.6 here:

https://evergreen.mongodb.com/version/55bf0d8a3ff12268ad000030_0

Comment by Daniel Pasette (Inactive) [ 31/Jul/15 ]

sue.loverso has identified the fix for this issue as WT-2007. This issue was exacerbated by increasing the number of log slots to 128 in 3.0.5 (as part of WT-1989), but was present in earlier versions as well.

Comment by Daniel Pasette (Inactive) [ 31/Jul/15 ]

Not sure if this is a duplicate, but same stack trace as SERVER-19339.

Comment by Daniel Pasette (Inactive) [ 31/Jul/15 ]

I managed to reproduce with the additional information from fcheung (Thank you!) and by adding an abort() in the error handling code for wiredTiger.
The repro just does a multi-threaded insert of 90k documents in a 2-node replset with an arbiter. Then I bring the secondary up and down manually.

We're still looking for the root cause.

diff --git a/src/third_party/wiredtiger/src/support/err.c b/src/third_party/wiredtiger/src/support/err.c
index fc6569c..e24ddb5 100644
--- a/src/third_party/wiredtiger/src/support/err.c
+++ b/src/third_party/wiredtiger/src/support/err.c
@@ -306,6 +306,9 @@ __wt_err(WT_SESSION_IMPL *session, int error, const char *fmt, ...)
 {
        va_list ap;
 
+    if (error == ENOMEM)
+        abort();
+
        /*
         * Ignore error returns from underlying event handlers, we already have
         * an error value to return.

dan:(git)3.0-mongo[verbose]/$  addr2line -p -C -f -i -e ./mongod  0xf3db39 0xf3d402 0xf3d7b6 0x7f57a7e84d40 0x7f57a7e84cc9 0x7f57a7e880d8 0x13804b7 0x134dc9a 0x1384e2d 0x133a911 0x1338aaf 0x1387880 0x137d014 0xd6b5de 0xd6b93a 0x90dedc 0xc25cf2 0xc90150 0xc91368 0xc8994f 0xf8bb88 0x7f57a8e47182 0x7f57a7f4847d
mongo::printStackTrace(std::ostream&) at /home/dan/code/3.0-mongo/src/mongo/util/stacktrace_posix.cpp:105
printSignalAndBacktrace at /home/dan/code/3.0-mongo/src/mongo/util/signal_handlers_synchronous.cpp:127
abruptQuit at /home/dan/code/3.0-mongo/src/mongo/util/signal_handlers_synchronous.cpp:194
?? ??:0
?? ??:0
?? ??:0
__wt_err at /home/dan/code/3.0-mongo/src/third_party/wiredtiger/src/support/err.c:310
__wt_realloc at /home/dan/code/3.0-mongo/src/third_party/wiredtiger/src/os_posix/os_alloc.c:122
__wt_buf_grow_worker at /home/dan/code/3.0-mongo/src/third_party/wiredtiger/src/support/scratch.c:48
__wt_log_slot_grow_buffers at /home/dan/code/3.0-mongo/src/third_party/wiredtiger/src/log/log_slot.c:378
__log_write_internal at /home/dan/code/3.0-mongo/src/third_party/wiredtiger/src/log/log.c:1827
 (inlined by) __wt_log_write at /home/dan/code/3.0-mongo/src/third_party/wiredtiger/src/log/log.c:1737
__wt_txn_commit at /home/dan/code/3.0-mongo/src/third_party/wiredtiger/src/txn/txn.c:418
__session_commit_transaction at /home/dan/code/3.0-mongo/src/third_party/wiredtiger/src/session/session_api.c:796
mongo::WiredTigerRecoveryUnit::_txnClose(bool) at /home/dan/code/3.0-mongo/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp:305
mongo::WiredTigerRecoveryUnit::_commit() at /home/dan/code/3.0-mongo/src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp:116
mongo::WriteUnitOfWork::commit() at /home/dan/code/3.0-mongo/src/mongo/db/operation_context.h:161
mongo::repl::writeOpsToOplog(mongo::OperationContext*, std::deque<mongo::BSONObj, std::allocator<mongo::BSONObj> > const&) at /home/dan/code/3.0-mongo/src/mongo/db/repl/oplog.cpp:424
mongo::repl::SyncTail::multiApply(mongo::OperationContext*, std::deque<mongo::BSONObj, std::allocator<mongo::BSONObj> >&) at /home/dan/code/3.0-mongo/src/mongo/db/repl/sync_tail.cpp:300
mongo::repl::SyncTail::oplogApplication() at /home/dan/code/3.0-mongo/src/mongo/db/repl/sync_tail.cpp:499 (discriminator 1)
mongo::repl::runSyncThread() at /home/dan/code/3.0-mongo/src/mongo/db/repl/rs_sync.cpp:135 (discriminator 1)
thread_proxy at /home/dan/code/3.0-mongo/src/third_party/boost/libs/thread/src/pthread/thread.cpp:133
?? ??:0
?? ??:0

Comment by Frederick Cheung [ 31/Jul/15 ]

Our deployment is a 2 node + arbiter replica set. The nodes are r3.2xl EC2 instances with a 4TB EBS volume.

The write workload is pure insert - we insert documents into the new collection, when we are done clients start reading from the newly created collection (the clients have a read preference of primary preferred so they shouldn't be reading from the secondary). I've attached the stats for one of those collections. It is one of our biggest - the smallest ones have 200-300k documents of 20k each

Comment by Frederick Cheung [ 31/Jul/15 ]

collection stats from one of our collections

Comment by Daniel Pasette (Inactive) [ 31/Jul/15 ]

fcheung/amit.ambasta@delhivery.com: thanks for the additional information. It does seem unlikely to be a real out of memory error, but that is the message that the storage engine is returning. I'm wondering if you could provide some more details about your deployment.

fcheung mentioned that the primary was under heavy write load. Is this also the case for you Amit?

Can you provide details on the collections involved (collection stats will provide average document size, number of indexes, count, etc) and also the types of operations (updates, inserts, deletes)?

Thanks
Dan

Comment by Amit Ambasta [ 31/Jul/15 ]

We are facing the same issue post upgrade to 3.0.5

2015-07-31T11:46:31.596+0000 I REPL     [ReplicationExecutor] syncing from: 172.0.8.212:27017
2015-07-31T11:46:31.619+0000 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to 172.0.8.212:27017
2015-07-31T11:47:26.151+0000 E STORAGE  [rsSync] WiredTiger (12) [1438343246:151304][12483:0x7fcb78548700], session.commit_transaction: memory allocation: Cannot allocate memory
2015-07-31T11:47:26.154+0000 I -        [rsSync] Invariant failure: s->commit_transaction(s, NULL) resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_recovery_unit.cpp 305
2015-07-31T11:47:26.193+0000 I CONTROL  [rsSync] 
 0xf5ba59 0xefa911 0xee10ba 0xd7b0bc 0xd7b1ca 0x914e2c 0xc35182 0xc9f8f0 0xca0b18 0xc9929f 0xfa9724 0x7fcb81b8c182 0x7fcb8065347d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B5BA59"},{"b":"400000","o":"AFA911"},{"b":"400000","o":"AE10BA"},{"b":"400000","o":"97B0BC"},{"b":"400000","o":"97B1CA"},{"b":"400000","o":"514E2C"},{"b":"400000","o":"835182"},{"b":"400000","o":"89F8F0"},{"b":"400000","o":"8A0B18"},{"b":"400000","o":"89929F"},{"b":"400000","o":"BA9724"},{"b":"7FCB81B84000","o":"8182"},{"b":"7FCB80559000","o":"FA47D"}],"processInfo":{ "mongodbVersion" : "3.0.5", "gitVersion" : "8bc4ae20708dbb493cb09338d9e7be6698e4a3a3", "uname" : { "sysname" : "Linux", "release" : "3.13.0-48-generic", "version" : "#80-Ubuntu SMP Thu Mar 12 11:16:15 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "695FC6828398A9DB1F99718671147885B5ED116D" }, { "b" : "7FFFDF2DC000", "elfType" : 3, "buildId" : "341443CC258F9F27709CAA36C8CB321E4BBFFF95" }, { "b" : "7FCB81B84000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7FCB81925000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "A20EFFEC993A8441FA17F2079F923CBD04079E19" }, { "b" : "7FCB8154A000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "F000D29917E9B6E94A35A8F02E5C62846E5916BC" }, { "b" : "7FCB81342000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7FCB8113E000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7FCB80E3A000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "4BF6F7ADD8244AD86008E6BF40D90F8873892197" }, { "b" : "7FCB80B34000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7FCB8091E000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "8D0AA71411580EE6C08809695C3984769F25725B" }, { "b" : "7FCB80559000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30C94DC66A1FE95180C3D68D2B89E576D5AE213C" }, { "b" : "7FCB81DA2000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf5ba59]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xefa911]
 mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0xDA) [0xee10ba]
 mongod(_ZN5mongo22WiredTigerRecoveryUnit9_txnCloseEb+0x2EC) [0xd7b0bc]
 mongod(_ZN5mongo22WiredTigerRecoveryUnit7_commitEv+0x8A) [0xd7b1ca]
 mongod(_ZN5mongo15WriteUnitOfWork6commitEv+0x1C) [0x914e2c]
 mongod(_ZN5mongo4repl15writeOpsToOplogEPNS_16OperationContextERKSt5dequeINS_7BSONObjESaIS4_EE+0x282) [0xc35182]
 mongod(_ZN5mongo4repl8SyncTail10multiApplyEPNS_16OperationContextERSt5dequeINS_7BSONObjESaIS5_EE+0x460) [0xc9f8f0]
 mongod(_ZN5mongo4repl8SyncTail16oplogApplicationEv+0x118) [0xca0b18]
 mongod(_ZN5mongo4repl13runSyncThreadEv+0x1EF) [0xc9929f]
 mongod(+0xBA9724) [0xfa9724]
 libpthread.so.0(+0x8182) [0x7fcb81b8c182]
 libc.so.6(clone+0x6D) [0x7fcb8065347d]
-----  END BACKTRACE  -----
2015-07-31T11:47:26.193+0000 I -        [rsSync] 
 
***aborting after invariant() failure

Comment by Frederick Cheung [ 31/Jul/15 ]

Hi, I've attached the config file (& the log since jira seems to have mangled the backtrace section). The only thing running other than mongo that isn't part of the stock amazon linux distribution is a newrelic server monitoring daemon. There shouldn't have been memory problems - mongo had basically all of the 61G that instance has to play with & downgrading to mongo 3.0.4 fixed the issue

Comment by Daniel Pasette (Inactive) [ 30/Jul/15 ]

The log message says "Cannot allocate memory"
Do you have other processes other than the single mongodb server running on that server? Can you include your mongod startup parameters (configuration file or cmd line params)?

Generated at Thu Feb 08 03:51:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.