[SERVER-17364] Idle cursors shouldn't hold storage engine resources Created: 24/Feb/15  Updated: 07/Apr/23  Resolved: 16/Jul/15

Status: Closed
Project: Core Server
Component/s: Querying, WiredTiger
Affects Version/s: None
Fix Version/s: 3.1.6

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Mathias Stearn
Resolution: Done Votes: 5
Labels: WTplaybook
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-17334 Runs out of sessions on capped collec... Closed
is duplicated by SERVER-19817 mongod crash "cannot allocate memory"... Closed
is duplicated by SERVER-19061 insert data fast, server crash with c... Closed
is duplicated by SERVER-26673 MongoDB Aborts after more than 20000 ... Closed
Related
related to SERVER-40161 Make failure of conn->open_session in... Closed
is related to SERVER-17386 Cursor cache causes excessive memory ... Closed
is related to SERVER-18788 wiredtiger 10x slower than mmap repli... Closed
is related to SERVER-20111 Explain helpers report incorrect plan... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Quint Iteration 5, Quint Iteration 6
Participants:

 Description   

Currently each idle ClientCursor holds on to it's original RecoveryUnit, so that getMores use the same session as the operation that created the cursor. This is a problem because they are relatively heavy-weight and there are a finite supply of them, at least for WiredTiger. We will change this to have getMores always use their own RU rather than swapping in the original. This requires some changes to the WT SDI::Cursor and RecordIterator implementations to be prepared for changes to the RU.

ORIGINAL TITLE AND DESCRIPTION: Too many open WT_SESSIONs causes fassert

Too many open cursors results in too many open sessions and an fassert. Probably we should just fail the cursor open in this case, not fassert.

2015-02-24T11:48:37.426-0500 I -        [conn8] Invariant failure: ret resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 49
2015-02-24T11:48:37.426-0500 E STORAGE  [conn6] WiredTiger (12) [1424796517:426942][50894:0x11193d000], connection.open_session: only configured to support 20010 sessions (including 10 internal): Cannot allocate memory
2015-02-24T11:48:37.427-0500 I -        [conn6] Invariant failure: ret resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 49
2015-02-24T11:48:37.443-0500 I CONTROL  [conn4] 
 0x10e85e199 0x10e8124e0 0x10e7ffa65 0x10e690fd9 0x10e691ccd 0x10e68fdef 0x10e68be81 0x10e689d6c 0x10e30a1c8 0x10e4ed3f0 0x10e4ed2b9 0x10e4badd5 0x10e3bcbf6 0x10e1069b1 0x10e821f9c 0x10e891991 0x7fff8888f899 0x7fff8888f72a 0x7fff88893fc9
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"10E0FB000","o":"763199"},{"b":"10E0FB000","o":"7174E0"},{"b":"10E0FB000","o":"704A65"},{"b":"10E0FB000","o":"595FD9"},{"b":"10E0FB000","o":"596CCD"},{"b":"10E0FB000","o":"594DEF"},{"b":"10E0FB000","o":"590E81"},{"b":"10E0FB000","o":"58ED6C"},{"b":"10E0FB000","o":"20F1C8"},{"b":"10E0FB000","o":"3F23F0"},{"b":"10E0FB000","o":"3F22B9"},{"b":"10E0FB000","o":"3BFDD5"},{"b":"10E0FB000","o":"2C1BF6"},{"b":"10E0FB000","o":"B9B1"},{"b":"10E0FB000","o":"726F9C"},{"b":"10E0FB000","o":"796991"},{"b":"7FFF8888E000","o":"1899"},{"b":"7FFF8888E000","o":"172A"},{"b":"7FFF8888E000","o":"5FC9"}],"processInfo":{ "mongodbVersion" : "3.0.0-rc9", "gitVersion" : "e6577bc37a2edba81b99146934cf7bad00c6e1b2", "uname" : { "sysname" : "Darwin", "release" : "13.4.0", "version" : "Darwin Kernel Version 13.4.0: Sun Aug 17 19:50:11 PDT 2014; root:xnu-2422.115.4~1/RELEASE_X86_64", "machine" : "x86_64" }, "somap" : [ { "path" : "/Users/bdlucas/mongodb/mongodb-osx-x86_64-3.0.0-rc9/bin/mongod", "machType" : 2, "b" : "10E0FB000", "buildId" : "E70F00A36A1938D48D63D6DDF7D85BDB" }, { "path" : "/usr/lib/libSystem.B.dylib", "machType" : 6, "b" : "7FFF8AC5A000", "buildId" : "70B235FCBCED367BBA6C67C299BAE7D9" }, { "path" : "/usr/lib/libc++.1.dylib", "machType" : 6, "b" : "7FFF8AB46000", "buildId" : "4F68DFC5207739A8A449CAC5FDEE7BDE" }, { "path" : "/usr/lib/system/libcache.dylib", "machType" : 6, "b" : "7FFF85D3B000", "buildId" : "BDC1E65B72A13DA3A57CB23159CAAD0B" }, { "path" : "/usr/lib/system/libcommonCrypto.dylib", "machType" : 6, "b" : "7FFF8488B000", "buildId" : "8C4F0CA0389C3EDCB155E62DD2187E1D" }, { "path" : "/usr/lib/system/libcompiler_rt.dylib", "machType" : 6, "b" : "7FFF871A8000", "buildId" : "4CD916B21B17362AB403EF24A1DAC141" }, { "path" : "/usr/lib/system/libcopyfile.dylib", "machType" : 6, "b" : "7FFF87004000", "buildId" : "CF29DFF605893590834C82E2316612E8" }, { "path" : "/usr/lib/system/libcorecrypto.dylib", "machType" : 6, "b" : "7FFF8FB5A000", "buildId" : "F3973C2814B63006BB2B00DD7F09ABC7" }, { "path" : "/usr/lib/system/libdispatch.dylib", "machType" : 6, "b" : "7FFF871B3000", "buildId" : "C4E4A18D3C3B3C9C8709A4270D998DE7" }, { "path" : "/usr/lib/system/libdyld.dylib", "machType" : 6, "b" : "7FFF89668000", "buildId" : "41077DD7F9093B8A863E72AE304EDE13" }, { "path" : "/usr/lib/system/libkeymgr.dylib", "machType" : 6, "b" : "7FFF8C53B000", "buildId" : "3AA8D85DCF003BD3A5A0E28E1A32A6D8" }, { "path" : "/usr/lib/system/liblaunch.dylib", "machType" : 6, "b" : "7FFF8F090000", "buildId" : "A40A0C7B321639B48AE0B5D3BAF1DA8A" }, { "path" : "/usr/lib/system/libmacho.dylib", "machType" : 6, "b" : "7FFF8A8F7000", "buildId" : "1D2910DFC0363A82A3FD44FF73B5FF9B" }, { "path" : "/usr/lib/system/libquarantine.dylib", "machType" : 6, "b" : "7FFF8EF5D000", "buildId" : "7A1A2BCBC03D3A25BFA43E569B2D2C38" }, { "path" : "/usr/lib/system/libremovefile.dylib", "machType" : 6, "b" : "7FFF8F6E2000", "buildId" : "3543F917928E3DB2A2F47AB73B4970EF" }, { "path" : "/usr/lib/system/libsystem_asl.dylib", "machType" : 6, "b" : "7FFF8F7E8000", "buildId" : "655FB34352CF3E2FB14DBEBF5AAEF94D" }, { "path" : "/usr/lib/system/libsystem_blocks.dylib", "machType" : 6, "b" : "7FFF8495D000", "buildId" : "FB856CD12AEA39078E9B1E54B6827F82" }, { "path" : "/usr/lib/system/libsystem_c.dylib", "machType" : 6, "b" : "7FFF85C52000", "buildId" : "6FD3A4004BB23B95B90CBE6E9D0D78FA" }, { "path" : "/usr/lib/system/libsystem_configuration.dylib", "machType" : 6, "b" : "7FFF8A66F000", "buildId" : "4998CB6A9D54390A9F575D1AC53C135C" }, { "path" : "/usr/lib/system/libsystem_dnssd.dylib", "machType" : 6, "b" : "7FFF8AB99000", "buildId" : "3F8C6A0730463E88858FD9CEFC43A405" }, { "path" : "/usr/lib/system/libsystem_info.dylib", "machType" : 6, "b" : "7FFF85BCC000", "buildId" : "7D41A156D2853849A2C3C04ADE797D98" }, { "path" : "/usr/lib/system/libsystem_kernel.dylib", "machType" : 6, "b" : "7FFF9233A000", "buildId" : "9EDE872E2A9E3A788E1DAB790794A098" }, { "path" : "/usr/lib/system/libsystem_m.dylib", "machType" : 6, "b" : "7FFF881E7000", "buildId" : "B7F0E2E4277733FCA787D6430B630D54" }, { "path" : "/usr/lib/system/libsystem_malloc.dylib", "machType" : 6, "b" : "7FFF8B098000", "buildId" : "A695B4E438E9332EA77229D31E3F1385" }, { "path" : "/usr/lib/system/libsystem_network.dylib", "machType" : 6, "b" : "7FFF8921D000", "buildId" : "8B1E1F1DA5CC3BAE8B1EABC84337A364" }, { "path" : "/usr/lib/system/libsystem_notify.dylib", "machType" : 6, "b" : "7FFF90AD0000", "buildId" : "9B34B4FEF5AD3F09A5F046AFF3571323" }, { "path" : "/usr/lib/system/libsystem_platform.dylib", "machType" : 6, "b" : "7FFF882AF000", "buildId" : "3C3D3DA832B9324398ECD89B9A1670B3" }, { "path" : "/usr/lib/system/libsystem_pthread.dylib", "machType" : 6, "b" : "7FFF8888E000", "buildId" : "AB498556B555310E9041F67EC9E00E2C" }, { "path" : "/usr/lib/system/libsystem_sandbox.dylib", "machType" : 6, "b" : "7FFF8BDF6000", "buildId" : "0D0B13EA6B7A3AC8BE60B548543BEB77" }, { "path" : "/usr/lib/system/libsystem_stats.dylib", "machType" : 6, "b" : "7FFF89245000", "buildId" : "C588E082D94B35109F9A7AD83B3402DE" }, { "path" : "/usr/lib/system/libunc.dylib", "machType" : 6, "b" : "7FFF8F6E0000", "buildId" : "62682455186236FE8A047A6B91256438" }, { "path" : "/usr/lib/system/libunwind.dylib", "machType" : 6, "b" : "7FFF892A7000", "buildId" : "78DCC3582FC1302EB3950155B47CB547" }, { "path" : "/usr/lib/system/libxpc.dylib", "machType" : 6, "b" : "7FFF863A9000", "buildId" : "AB40CD57F4543FD4B41563B3C0D5C624" }, { "path" : "/usr/lib/libobjc.A.dylib", "machType" : 6, "b" : "7FFF8C075000", "buildId" : "AD7FD984271E30F4A3616B20319EC73B" }, { "path" : "/usr/lib/libauto.dylib", "machType" : 6, "b" : "7FFF8A531000", "buildId" : "F45C36E8B6063886B5B1B6745E757CA8" }, { "path" : "/usr/lib/libc++abi.dylib", "machType" : 6, "b" : "7FFF921D9000", "buildId" : "21A807D367323455B77F743E9F916DF0" }, { "path" : "/usr/lib/libDiagnosticMessagesClient.dylib", "machType" : 6, "b" : "7FFF863A7000", "buildId" : "4CDB0F7BC0AF3424BC39495696F0DB1E" } ] }}
 mongod(_ZN5mongo15printStackTraceERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEE+0x39) [0x10e85e199]
 mongod(_ZN5mongo10logContextEPKc+0x100) [0x10e8124e0]
 mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0x175) [0x10e7ffa65]
 mongod(_ZN5mongo17WiredTigerSessionC2EP15__wt_connectionii+0x99) [0x10e690fd9]
 mongod(_ZN5mongo22WiredTigerSessionCache10getSessionEv+0xAD) [0x10e691ccd]
 mongod(_ZN5mongo16WiredTigerCursorC2ERKNSt3__112basic_stringIcNS1_11char_traitsIcEENS1_9allocatorIcEEEEybPNS_16OperationContextE+0x4F) [0x10e68fdef]
 mongod(_ZN5mongo21WiredTigerRecordStore8IteratorC2ERKS0_PNS_16OperationContextERKNS_8RecordIdERKNS_20CollectionScanParams9DirectionEb+0x61) [0x10e68be81]
 mongod(_ZNK5mongo21WiredTigerRecordStore11getIteratorEPNS_16OperationContextERKNS_8RecordIdERKNS_20CollectionScanParams9DirectionE+0x6C) [0x10e689d6c]
 mongod(_ZN5mongo14CollectionScan4workEPm+0x198) [0x10e30a1c8]
 mongod(_ZN5mongo12PlanExecutor18getNextSnapshottedEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0xA0) [0x10e4ed3f0]
 mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x39) [0x10e4ed2b9]
 mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3_b+0xA35) [0x10e4badd5]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortEb+0xA36) [0x10e3bcbf6]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xE1) [0x10e1069b1]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x32C) [0x10e821f9c]
 mongod(_ZN5boost12_GLOBAL__N_112thread_proxyEPv+0xB1) [0x10e891991]
 libsystem_pthread.dylib(_pthread_body+0x8A) [0x7fff8888f899]
 libsystem_pthread.dylib(_pthread_struct_init+0x0) [0x7fff8888f72a]
 libsystem_pthread.dylib(thread_start+0xD) [0x7fff88893fc9]
-----  END BACKTRACE  -----
2015-02-24T11:48:37.443-0500 I -        [conn4] 



 Comments   
Comment by Ramon Fernandez Marina [ 11/Nov/15 ]

grant@teambanjo.com, please open a new ticket so we can investigate your issue separately. Also please upload full logs for the affected node.

Thanks,
Ramón.

Comment by Grant Ridder [ 10/Nov/15 ]

I am seeing this in a 3.0.7 cluster

2015-11-10T17:06:50.525+0000 E STORAGE  [conn100] WiredTiger (12) [1447175210:525261][10824:0x7f27a1d2b700], connection.open_session: only configured to support 20010 sessions (including 10 internal): Cannot allocate memory
2015-11-10T17:06:50.525+0000 I -        [conn100] Invariant failure: conn->open_session(conn, NULL, "isolation=snapshot", &_session) resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 44
2015-11-10T17:06:50.533+0000 I CONTROL  [conn100] 
 0xf5c949 0xefb781 0xee1eba 0xd7fb5e 0xd82179 0xd7b629 0xd7beb1 0xd7bf05 0xd69be5 0xa836e5 0xa0c380 0xbcf404 0xbcf7b4 0xb9dd34 0xab3c50 0x80e43d 0xf0f7eb 0x7f27cb375df5 0x7f27c9e241ad
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B5C949"},{"b":"400000","o":"AFB781"},{"b":"400000","o":"AE1EBA"},{"b":"400000","o":"97FB5E"},{"b":"400000","o":"982179"},{"b":"400000","o":"97B629"},{"b":"400000","o":"97BEB1"},{"b":"400000","o":"97BF05"},{"b":"400000","o":"969BE5"},{"b":"400000","o":"6836E5"},{"b":"400000","o":"60C380"},{"b":"400000","o":"7CF404"},{"b":"400000","o":"7CF7B4"},{"b":"400000","o":"79DD34"},{"b":"400000","o":"6B3C50"},{"b":"400000","o":"40E43D"},{"b":"400000","o":"B0F7EB"},{"b":"7F27CB36E000","o":"7DF5"},{"b":"7F27C9D2E000","o":"F61AD"}],"processInfo":{ "mongodbVersion" : "3.0.7", "gitVersion" : "6ce7cbe8c6b899552dadd907604559806aa2e9bd", "uname" : { "sysname" : "Linux", "release" : "3.10.0-123.8.1.el7.x86_64", "version" : "#1 SMP Mon Sep 22 19:06:58 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "135D3273A8F29EDEA1A35CFC89771434807D4716" }, { "b" : "7FFF334FE000", "elfType" : 3, "buildId" : "F109E97B5BF927997E7FE69A3AE544A21A01FDB4" }, { "b" : "7F27CB36E000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "12F30315D4F4A2FE58B1977405C8B5515861E66B" }, { "b" : "7F27CB101000", "path" : "/lib64/libssl.so.10", "elfType" : 3, "buildId" : "BB96EE99138B19FECDAB55E80A1728B648ECAD50" }, { "b" : "7F27CAD1A000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "B154203FB7C05AEE29D5D6F6C000305191209FE4" }, { "b" : "7F27CAB12000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "7376A07360DC57189A8F92B20AA4AA1CAEA80551" }, { "b" : "7F27CA90E000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "4DFEE4EA9AE8FDD4C71BD4CCC0727222F19DF810" }, { "b" : "7F27CA607000", "path" : "/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "405EACD649720B8668FFBBA197CBF030A7EF6296" }, { "b" : "7F27CA305000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "A1AA62B29765BE03A36BF927B047EEEF8696EEC6" }, { "b" : "7F27CA0EF000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "5D3D7256AE68BCFF41E312A24825ED80ECA88A73" }, { "b" : "7F27C9D2E000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "C31FFE7942BFD77B2FCA8F9BD5709D387A86D3BC" }, { "b" : "7F27CB58A000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9866E1D2BA61EBB4CE4F009FACDAACC24EF3B804" }, { "b" : "7F27C9AE2000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "34672D541C8C9C5C1C25CB4F3F332CC9D3E604AD" }, { "b" : "7F27C97FF000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "45CB7F6CD322F5B55FF8B635F7EC1578631CCAEA" }, { "b" : "7F27C95FB000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "3A1166709F88740C49E060731832E3FAD2DFB66B" }, { "b" : "7F27C93C9000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "23A2D854538903E2B84EF0882046DD95522C8B59" }, { "b" : "7F27C91B3000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "E45643F27F3B3E960F3691AFC6EC27A98EF7B46B" }, { "b" : "7F27C8FA4000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "F4A3D5E7E23F871751CA8F250421F8CF83447AD2" }, { "b" : "7F27C8DA0000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "2E01D5AC08C1280D013AAB96B292AC58BC30A263" }, { "b" : "7F27C8B86000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "AC596E865AF0D14B10F7B707F47D2031AD6D68DC" }, { "b" : "7F27C8961000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "82FF6B18E1E42825CC2D060F969479AD4AF2F62C" }, { "b" : "7F27C8700000", "path" : "/lib64/libpcre.so.1", "elfType" : 3, "buildId" : "298B19C64B19995F2AA4DA7B852E90BA5302F630" }, { "b" : "7F27C84DB000", "path" : "/lib64/liblzma.so.5", "elfType" : 3, "buildId" : "218D03D1F6CF1A099A4D467B5E8ECF4F2BF45750" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf5c949]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xefb781]
 mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0xDA) [0xee1eba]
 mongod(_ZN5mongo17WiredTigerSessionC1EP15__wt_connectioni+0x9E) [0xd7fb5e]
 mongod(_ZN5mongo22WiredTigerSessionCache10getSessionEv+0x409) [0xd82179]
 mongod(_ZN5mongo22WiredTigerRecoveryUnit14_ensureSessionEv+0x19) [0xd7b629]
 mongod(_ZN5mongo22WiredTigerRecoveryUnit10getSessionEPNS_16OperationContextE+0x11) [0xd7beb1]
 mongod(_ZN5mongo16WiredTigerCursorC1ERKSsmbPNS_16OperationContextE+0x35) [0xd7bf05]
 mongod(_ZNK5mongo21WiredTigerIndexUnique9newCursorEPNS_16OperationContextEi+0x55) [0xd69be5]
 mongod(_ZNK5mongo22BtreeBasedAccessMethod10findSingleEPNS_16OperationContextERKNS_7BSONObjE+0x25) [0xa836e5]
 mongod(_ZN5mongo11IDHackStage4workEPm+0xD0) [0xa0c380]
 mongod(_ZN5mongo12PlanExecutor18getNextSnapshottedEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0xA4) [0xbcf404]
 mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x34) [0xbcf7b4]
 mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3_+0xA74) [0xb9dd34]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xB10) [0xab3c50]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xDD) [0x80e43d]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x34B) [0xf0f7eb]
 libpthread.so.0(+0x7DF5) [0x7f27cb375df5]
 libc.so.6(clone+0x6D) [0x7f27c9e241ad]
-----  END BACKTRACE  -----
2015-11-10T17:06:50.533+0000 I -        [conn100] 
 
***aborting after invariant() failure
 
 
2015-11-10T17:07:01.908+0000 I CONTROL  ***** SERVER RESTARTED *****

$ rpm -qa | grep mongo
mongodb-org-tools-3.0.7-1.el7.x86_64
mongodb-org-shell-3.0.7-1.el7.x86_64
mongodb-org-server-3.0.7-1.el7.x86_64
$ 

$ uname -a
Linux host01 3.10.0-123.8.1.el7.x86_64 #1 SMP Mon Sep 22 19:06:58 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ 

$ cat /etc/redhat-release 
CentOS Linux release 7.1.1503 (Core) 
$

Comment by a zhifan [ 03/Sep/15 ]

hi @redbeard0531, could you explain the relationship for the following 3 things:
1. MongoDB session, like 'mongo `hostname`'
2. ClientCursor
3. WT_SESSION.

ComputedCartRS:SECONDARY> db.serverStatus().wiredTiger.session

{ "open cursor count" : 48476, "open session count" : 14742 }
  1. netstat -anp | grep -c 27017
    1338

I though the number should be same, but in fact, it is not.

Per your statements above, "once you have established 20k ClientCursors the server is stuck",
1. can I increase this limits? by set a bigger maxConns?
2. what is the impact of "db.runCommand(

{setParameter:1, cursorTimeoutMillis: 60000}

)". If there is not socket communication between client and server for 60 seconds, the cursor will be closed due to this? if so, if all the operation can be completed in 60 seconds, this change will not hurt anything at all?
3. I see

{ "open cursor count" : 48476}

then I set it to 60 seconds, after 10 minutes, I didn't see the count decreased, is it expected?

Comment by Githook User [ 28/Aug/15 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: SERVER-17364: Remove obsolete warning in comment
Branch: master
https://github.com/mongodb/mongo/commit/c60f9d3318b3e3cbe8784cd7c3d9871c7a3391c2

Comment by Ramon Fernandez Marina [ 20/Aug/15 ]

Thanks for the additional information yangzheng, and sorry you run into this problem. We found additional issues with the cursor cache in 3.0.5 as described in SERVER-19966, and a palliative fix for 3.0.6 was put in SERVER-19987. We're in the process of releasing 3.0.6, and release candidate 3.0.6-rc2 will be available for download later today.

Seeing that you can trigger this error, would you be willing to try 3.0.6-rc2 and make sure it solves this problem for you?

Thanks,
Ramón.

Comment by yz [ 20/Aug/15 ]

hi Ramón it is still this problem in 3.0.5

2015-08-20T09:46:04.851+0800 I -        [conn386442] Invariant failure: conn->open_session(conn, NULL, "isolation=snapshot", &_session) resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 47
2015-08-20T09:46:04.852+0800 E STORAGE  [conn382876] WiredTiger (12) [1440035164:852660][17223:0x7f92ff640700], connection.open_session: only configured to support 20010 sessions (including 10 internal): Cannot allocate memory

insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn      set repl     time
    54  2057     23     *0      57   304|0     0.2    1.8       0  7.9G 6.0G   0|0   1|0  321k     3m 1286 rept PRI 10:03:14
    48  2007     30     *0      47   327|0     0.2    1.8       0  7.9G 6.0G   0|0   2|0  313k     2m 1285 rept  PRI 10:03:15

             total       used       free     shared    buffers     cached
Mem:         64375      33378      30997          0        327      25568
-/+ buffers/cache:       7483      56892
Swap:         7999         17       7982
 
conf:
storageEngine=wiredTiger
wiredTigerCacheSizeGB=58
 
now :
db.serverStatus().wiredTiger.session
{ "open cursor count" : 315293, "open session count" : 19033 }

Comment by Ramon Fernandez Marina [ 03/Aug/15 ]

yangzheng, excessive memory consumption by cursor cache was fixed in SERVER-17386, which was backported to 3.0.5. This ticket contains additional improvements for the 3.1 branch and can't be easily backported – but SERVER-17386 should address users' needs.

Regards,
Ramón.

Comment by yz [ 03/Aug/15 ]

why do not fix this bug in 3.0.5 ???

Comment by Githook User [ 16/Jul/15 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-17364 Don't stash RecoveryUnits across getMores

We now tell PlanExecutors to detach from their OperationContexts and to shed
all storage engine resources before stashing the ClientCursor. This is a
heavier weight operation than a normal save/restoreState which is no longer
allowed to change the OperationContext.
Branch: master
https://github.com/mongodb/mongo/commit/b9d2e18ca68246e5d21ed42a846ff4094867f159

Comment by Githook User [ 16/Jul/15 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-17364 Unify handling of child stages into PlanStage base class

This is prep for adding more methods that need to propagate to children.
Branch: master
https://github.com/mongodb/mongo/commit/c832bc753c29f91597b75fa02c0d9019c3c20b0f

Comment by Glen Miner [ 23/Jun/15 ]

Just as a follow up about our use of cursorTimeoutMillis – we tried extensively to eliminate all orphaned cursors from our system (PHP and Mongo JS shell) and ultimately could not eliminate enough leaks to prevent an alarming numbers of cursors being kept around (1000's).

We have run into problems with cursor timeouts as a result of our change however we can work around these problems.

Perhaps if SERVER-3090 were implemented we'd be able to isolate leaks and live with a higher timeout.

Comment by Ramon Fernandez Marina [ 21/Jun/15 ]

Users running into this issue may try setting a low value for the idle cursor timeout as a workaround. SERVER-8188 explains how to do this:

use admin
db.runCommand({setParameter:1, cursorTimeoutMillis: 60000})

NOTE: using this workaround may affect the way your application works if it relies on cursors being opened for longer than 60 seconds. The preferred way to work around this issue is to make sure cursors are properly closed.

Comment by Glen Miner [ 01/Jun/15 ]

I did manage to get a serverStats out of death #4 just now – it's attached to SERVER-18747 as stats2_ss.txt

Comment by Glen Miner [ 29/May/15 ]

Actually things are pretty grim; we've had 3 cascading failures – all similar – over the last week. I just spent some time analyzing the logs from the first failure and we have exonerated backups from suspicion – we're now utterly convinced that something is very rotton and leading to some sort of deadlock or massive lock bubble that causes us to fall over (even with massively increased limits).

SERVER-18747

Comment by Bruce Lucas (Inactive) [ 28/May/15 ]

Hi Glen,

How are things progressing? As you surmised setting the WT session_max parameter is indeed a potential workaround for this.

If you do still encounter issues related to this can you please give us db.serverStatus() output? A single point in time when the server is experiencing problems would be helpful, even better would be to collect a timeseries of db.serverStatus() while you reproduce the issue:

  mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep($delay*1000)}" >ss.log &

Thanks,
Bruce

Comment by Glen Miner [ 22/May/15 ]

This problem caused a cascade of failures for us this morning; one node after another fell down until I got the panic phone call; we're not sure where the cursors are coming from but backup task is prime suspect; we just completed upgrade from 2.6 and the last secondary node had finished replicating maybe 12 hrs before disaster struck.

For posterity: countermeasures we've tried:

1) mongd.conf

setParameter:
cursorTimeoutMillis: 30000

The default of 10 minutes seems to be very bad for us; 30 seconds seems to have caused some problems but working around those seems preferred to having things fall over.

2) apache spare clients were set aggressively high; we use prefork and are currently on PHP mongo driver ver 1.6.8; we found that having a pool of 2000 preforked worker processes meant that we'd end up with thousands of connections lingering around (it seems the connection pool is per process). Prior to WiredTiger this was only memory; now it's a threat to stability.

We cranked this down to around 200 workers * 3 PHP servers and this means we get far fewer dormant workers holding on to cursors (however in a rush we will still fork up to 3000/server so this is not enough).

One interesting thing was that after I "sudo service apache2 restart" on all web servers to bring the number of processes down to ~500 it took a long time for Mongo to flush the connections out. I watched the log on the database machine and you could see the connection count going down and down very slowly – we went from "6746 connections now open" to around 700. This was not a CPU-limit; these are 32-core beasts and were 95% idle.

These measures seemed to have helped a lot; currently we're under pretty heavy load and no node is holding onto more than a few dozen cursors and the number of connections is nearly 1/10th what it was.

However, we're nervous that in the event of a failover there will be a thundering herd of requests after the handoff – we estimate the max number of clients will peak around 9000; somehow this resulted in > 20010 sessions; I'm guessing this is somehow related to the latency I observed in the connections being expired by the server. We're obviously concerned that while we've reduced the risk of saturation we haven't eliminated the threat that saturation causes to stability.

The last thing we're going to try is some undocumented hackery in the config to double session_max:

storage:
wiredTiger:
engineConfig:
configString: "session_max=40000"

I tested this on a dev machine and as far as I can tell it's making its way down into to WT. I'm not sure if this will have memory risks – the DB servers all have 128GB of ram and the database itself is maybe 60GB resident so I think this is low-risk for us. Since we can't change this without a restart we're going to have to ride out the weekend storm without it.

Comment by Mathias Stearn [ 21/Apr/15 ]

I looked into making this non-fatal, but now I don't think it is a good idea without other large changes to mongod. Note that this is actually a failure to allocate a new WT_SESSION not a WT_CURSOR. There is (roughly) one WT_SESSION per active operation + one per ClientCursor. The problem with making this error non-fatal is that once you have established 20k ClientCursors the server is stuck. You can't send any new operations to the server, even OpKillCursor since every new op needs to create an OperationContext. Given that, I think we are better off killing the server and failing over, rather than leaving it around as a zombie.

I think the correct solution is to make idle ClientCursors not hold any WT resources, but that is a fairly large change and definitely won't be backportable.

Comment by Vanni Fan [ 09/Apr/15 ]

----- BEGIN BACKTRACE -----
{"backtrace":[

{"b":"400000","o":"B69089"}

,

{"b":"400000","o":"B09881"}

,

{"b":"400000","o":"AEFFBA"}

,

{"b":"400000","o":"98C570"}

,

{"b":"400000","o":"98CA46"}

,

{"b":"400000","o":"987E7E"}

,

{"b":"400000","o":"987EC5"}

,

{"b":"400000","o":"976B25"}

,

{"b":"400000","o":"6A0728"}

,

{"b":"400000","o":"62A6C2"}

,

{"b":"400000","o":"62ADFD"}

,

{"b":"400000","o":"61EB7D"}

,

{"b":"400000","o":"6410D5"}

,

{"b":"400000","o":"7EC054"}

,

{"b":"400000","o":"7EC404"}

,

{"b":"400000","o":"7BAA0C"}

,

{"b":"400000","o":"6D0AB0"}

,

{"b":"400000","o":"42A50D"}

,

{"b":"400000","o":"B1CC7B"}

,

{"b":"7FA581676000","o":"7E9A"}

,

{"b":"7FA580075000","o":"F42ED"}

],"processInfo":{ "mongodbVersion" : "3.0.1", "gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952", "uname" :

{ "sysname" : "Linux", "release" : "3.2.0-23-generic", "version" : "#36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012", "machine" : "x86_64" }

, "somap" : [

{ "elfType" : 2, "b" : "400000", "buildId" : "7F7D0BB0BFBF6B9A69384E0AB86648732D9C5BA8" }

,

{ "b" : "7FFF272D9000", "elfType" : 3, "buildId" : "0C0CC7757C6933FC35A55C90475D300ADAF970C5" }

,

{ "b" : "7FA581676000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "C340AF9DEE97C17C730F7D03693286C5194A46B8" }

,

{ "b" : "7FA58141A000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "110068FEAE2CA610641E122FFC1B883C015CF46D" }

,

{ "b" : "7FA581052000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "7FF6D799EC2B13199AA63204D944E2D6420A3960" }

,

{ "b" : "7FA580E4A000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "352C5B373A50E6C4AB881A5DB6F5766FDF81EEE0" }

,

{ "b" : "7FA580C46000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "D181AF551DBBC43E9D55913D532635FDE18E7C4E" }

,

{ "b" : "7FA580946000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "B534DA725D06A04267EB2FEB92B9CC14C838B57B" }

,

{ "b" : "7FA58064A000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "817AA99B3DD02501F8BC04A3E9A9358A08F20D7D" }

,

{ "b" : "7FA580434000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "ECF322A96E26633C5D10F18215170DD4395AF82C" }

,

{ "b" : "7FA580075000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "57B3EB388FB6FE036D1E708259DACA7FCE990462" }

,

{ "b" : "7FA581893000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "E25AD1A11CCF57E734116B8EC9C69F643DCA9F18" }

,

{ "b" : "7FA57FE5E000", "path" : "/lib/x86_64-linux-gnu/libz.so.1", "elfType" : 3, "buildId" : "F695ECFCF3918D5D34989398A14B7ECDD9F46CD0" }

] }}
mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf69089]
mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf09881]
mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0xDA) [0xeeffba]
mongod(ZN5mongo17WiredTigerSessionC1EP15_wt_connectionii+0xA0) [0xd8c570]
mongod(_ZN5mongo22WiredTigerSessionCache10getSessionEv+0x4C6) [0xd8ca46]
mongod(_ZN5mongo22WiredTigerRecoveryUnit10getSessionEPNS_16OperationContextE+0x3E) [0xd87e7e]
mongod(_ZN5mongo16WiredTigerCursorC1ERKSsmbPNS_16OperationContextE+0x35) [0xd87ec5]
mongod(_ZNK5mongo21WiredTigerIndexUnique9newCursorEPNS_16OperationContextEi+0x55) [0xd76b25]
mongod(_ZNK5mongo22BtreeBasedAccessMethod9newCursorEPNS_16OperationContextERKNS_13CursorOptionsEPPNS_11IndexCursorE+0x28) [0xaa0728]
mongod(_ZN5mongo9IndexScan13initIndexScanEv+0x62) [0xa2a6c2]
mongod(_ZN5mongo9IndexScan4workEPm+0x4D) [0xa2adfd]
mongod(_ZN5mongo10FetchStage4workEPm+0xCD) [0xa1eb7d]
mongod(_ZN5mongo16ShardFilterStage4workEPm+0x55) [0xa410d5]
mongod(_ZN5mongo12PlanExecutor18getNextSnapshottedEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0xA4) [0xbec054]
mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x34) [0xbec404]
mongod(ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3+0xA3C) [0xbbaa0c]
mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xB10) [0xad0ab0]
mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xDD) [0x82a50d]
mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x34B) [0xf1cc7b]
libpthread.so.0(+0x7E9A) [0x7fa58167de9a]
libc.so.6(clone+0x6D) [0x7fa5801692ed]
----- END BACKTRACE -----

Comment by Vanni Fan [ 09/Apr/15 ]

When I use Jmeter load test my WebService(Nginx + PHP-FPM [mongo-php-driver-1.6.6] ) services from time to time, I got this error too.

Mongodb Server: Linux hk-radar-mongo-slave1 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
Mongodb Version: mongodb-linux-x86_64-ubuntu1204-3.0.1

-------------------------------
2015-04-08T21:34:30.921+0800 I - [conn756196] Invariant failure: ret resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 49
2015-04-08T21:34:30.924+0800 E STORAGE [conn757382] WiredTiger (12) [1428500070:924706][40771:0x7fa4d6a87700], connection.open_session: only configured to support 20010 sessions (including 10 internal): Cannot allocate memor
2015-04-08T21:34:30.924+0800 I - [conn757382] Invariant failure: ret resulted in status UnknownError 12: Cannot allocate memory at src/mongo/db/storage/wiredtiger/wiredtiger_session_cache.cpp 49
2015-04-08T21:34:30.924+0800 I CONTROL [conn755818]
0xf69089 0xf09881 0xeeffba 0xd8c570 0xd8ca46 0xd87e7e 0xd87ec5 0xd76b25 0xaa0728 0xa2a6c2 0xa2adfd 0xa1eb7d 0xa410d5 0xbec054 0xbec404 0xbbaa0c 0xad0ab0 0x82a50d 0xf1cc7b 0x7fa58167de9a 0x7fa5801692ed
----- BEGIN BACKTRACE -----
{"backtrace":[

{"b":"400000","o":"B69089"}

,

{"b":"400000","o":"B09881"}

,

{"b":"400000","o":"AEFFBA"}

,

{"b":"400000","o":"98C570"}

,

{"b":"400000","o":"98CA46"}

,

{"b":"400000","o":"987E7E"}

,

{"b":"400000","o":"987EC5"}

,{"b":"400000","
mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf69089]
mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf09881]
mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0xDA) [0xeeffba]
mongod(ZN5mongo17WiredTigerSessionC1EP15_wt_connectionii+0xA0) [0xd8c570]
mongod(_ZN5mongo22WiredTigerSessionCache10getSessionEv+0x4C6) [0xd8ca46]
mongod(_ZN5mongo22WiredTigerRecoveryUnit10getSessionEPNS_16OperationContextE+0x3E) [0xd87e7e]
mongod(_ZN5mongo16WiredTigerCursorC1ERKSsmbPNS_16OperationContextE+0x35) [0xd87ec5]
mongod(_ZNK5mongo21WiredTigerIndexUnique9newCursorEPNS_16OperationContextEi+0x55) [0xd76b25]
mongod(_ZNK5mongo22BtreeBasedAccessMethod9newCursorEPNS_16OperationContextERKNS_13CursorOptionsEPPNS_11IndexCursorE+0x28) [0xaa0728]
mongod(_ZN5mongo9IndexScan13initIndexScanEv+0x62) [0xa2a6c2]
mongod(_ZN5mongo9IndexScan4workEPm+0x4D) [0xa2adfd]
mongod(_ZN5mongo10FetchStage4workEPm+0xCD) [0xa1eb7d]
mongod(_ZN5mongo16ShardFilterStage4workEPm+0x55) [0xa410d5]
mongod(_ZN5mongo12PlanExecutor18getNextSnapshottedEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0xA4) [0xbec054]
mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x34) [0xbec404]
mongod(ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3+0xA3C) [0xbbaa0c]
mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xB10) [0xad0ab0]
mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xDD) [0x82a50d]
mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x34B) [0xf1cc7b]
libpthread.so.0(+0x7E9A) [0x7fa58167de9a]
libc.so.6(clone+0x6D) [0x7fa5801692ed]
----- END BACKTRACE -----
2015-04-08T21:34:30.925+0800 I - [conn755818]

***aborting after invariant() failure

Generated at Thu Feb 08 03:44:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.