[SERVER-38292] mongodb crash with Got signal: 11 (Segmentation fault) Created: 28/Nov/18  Updated: 11/Jan/23  Resolved: 03/Jun/19

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Manan Shah Assignee: Alexander Gorrod
Resolution: Cannot Reproduce Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File local_dmesg.txt     Text File local_system_log_Nov28.log    
Issue Links:
Depends
is depended on by SERVER-46192 mongodb crash with Got signal: 11 (Se... Backlog
Related
is related to WT-4037 WT_REF structures freed while still i... Closed
is related to SERVER-72715 Invalid access at address: 0 Closed
is related to WT-3076 Add a general-purpose epoch manager Closed
Operating System: ALL
Sprint: Storage Engines 2019-02-25, Storage Engines 2019-06-03
Participants:
Case:
Story Points: 0

 Description   

We are also having this issue both on 3.2.19 and 3.4.14 (both are WT engine). Here's the stack from 3.4.14 logs. Can you brief me on what exactly is the fix in WT-4037. It is not clear in that ticket.

2018-11-28T10:06:13.885-0600 F - [thread2] Invalid access at address: 0
2018-11-28T10:06:13.906-0600 F - [thread2] Got signal: 11 (Segmentation fault). 0x562636fa4a51 0x562636fa3c69 0x562636fa42d6 0x7f74ee9647e0 0x56263716bca3 0x56263716bd7c 0x562637aa1d7a 0x5626378e0466 0x5626378e05ea 0x5626378e0c08 0x5626378e3ac3 0x562637934294 0x562637934e47 0x56263792f813 0x56263792fba7 0x5626379316a3 0x56263799b906 0x7f74ee95caa1 0x7f74ee6a9c4d
----- BEGIN BACKTRACE -----
{"backtrace":[\{"b":"562635A2B000","o":"1579A51","s":"_ZN5mongo15printStackTraceERSo"},\{"b":"562635A2B000","o":"1578C69"},\{"b":"562635A2B000","o":"15792D6"},\{"b":"7F74EE955000","o":"F7E0"},\{"b":"562635A2B000","o":"1740CA3","s":"_ZN8tcmalloc11ThreadCache21ReleaseToCentralCacheEPNS0_8FreeListEmi"},\{"b":"562635A2B000","o":"1740D7C","s":"_ZN8tcmalloc11ThreadCache11ListTooLongEPNS0_8FreeListEm"},\{"b":"562635A2B000","o":"2076D7A","s":"_ZdlPvRKSt9nothrow_t"},\{"b":"562635A2B000","o":"1EB5466","s":"__wt_split_stash_discard"},\{"b":"562635A2B000","o":"1EB55EA"},\{"b":"562635A2B000","o":"1EB5C08"},\{"b":"562635A2B000","o":"1EB8AC3","s":"__wt_split_reverse"},\{"b":"562635A2B000","o":"1F09294"},\{"b":"562635A2B000","o":"1F09E47","s":"__wt_evict"},\{"b":"562635A2B000","o":"1F04813"},\{"b":"562635A2B000","o":"1F04BA7"},\{"b":"562635A2B000","o":"1F066A3","s":"__wt_evict_thread_run"},\{"b":"562635A2B000","o":"1F70906","s":"__wt_thread_run"},\{"b":"7F74EE955000","o":"7AA1"},\{"b":"7F74EE5C1000","o":"E8C4D","s":"clone"}],"processInfo":\{ "mongodbVersion" : "3.4.14", "gitVersion" : "fd954412dfc10e4d1e3e2dd4fac040f8b476b268", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "2.6.32-754.6.3.el6.x86_64", "version" : "#1 SMP Tue Oct 9 17:27:49 UTC 2018", "machine" : "x86_64" }, "somap" : [ \{ "b" : "562635A2B000", "elfType" : 3, "buildId" : "64CD384C41ACC8D81741F0DFC0F9A3D7756F81FF" }, \{ "b" : "7FFF3E8F6000", "elfType" : 3, "buildId" : "F9F48CC73D4D61AE273899B31855C6589EE5EA8D" }, \{ "b" : "7F74EF7FD000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "BECFB85A8BC084042D5BF2BA9E66325CE798B659" }, \{ "b" : "7F74EF418000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "CBDA444A7109874C5350AE9CEEF3F82F749B347F" }, \{ "b" : "7F74EF210000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "552CEC3216281CCFD7FA6432C723D50163255823" }, \{ "b" : "7F74EF00C000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "2AF795BFFD122309BA3359FEBABB5D0967403D17" }, \{ "b" : "7F74EED88000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "4AAEE970B045D8BF946578B9C7F3AB5CDE9AB44A" }, \{ "b" : "7F74EEB72000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "EDC925E58FE28DCA536993EB13179C739F1E6566" }, \{ "b" : "7F74EE955000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "4EA475CD3FD3B69B6C95D9381FA74B36DB4992EF" }, \{ "b" : "7F74EE5C1000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "BCA7789C2EA8E28CB7CE553E183AC7E7EE36F8A2" }, \{ "b" : "7F74EFA69000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "97AF4B77212F74CFF72B6C013E6AA2D74A97EF60" }, \{ "b" : "7F74EE37D000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "9A737F8BF10FC99C37CC404D3FC188F6E11FEDD9" }, \{ "b" : "7F74EE096000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "8D3D6E28DF6EB3752642A7031AAC17D39EA4265D" }, \{ "b" : "7F74EDE92000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "7EC54D6E88BB7D2C1284117C2A483496A01EAAF4" }, \{ "b" : "7F74EDC66000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "CC89B4C8CDCCD32BA610BC72784DC3B7E9BD9E19" }, \{ "b" : "7F74EDA50000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, \{ "b" : "7F74ED845000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "E0C522C589F775C324330BE09CE67DC83950A213" }, \{ "b" : "7F74ED642000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, \{ "b" : "7F74ED428000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "4786A2A5D30B121601958E84D643C70C13C4FBA5" }, \{ "b" : "7F74ED209000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "B4576BE308DDCF7BC31F7304E4734C3D846D0236" } ] }}
 mongod-3.4(_ZN5mongo15printStackTraceERSo+0x41) [0x562636fa4a51]
 mongod-3.4(+0x1578C69) [0x562636fa3c69]
 mongod-3.4(+0x15792D6) [0x562636fa42d6]
 libpthread.so.0(+0xF7E0) [0x7f74ee9647e0]
 mongod-3.4(_ZN8tcmalloc11ThreadCache21ReleaseToCentralCacheEPNS0_8FreeListEmi+0xE3) [0x56263716bca3]
 mongod-3.4(_ZN8tcmalloc11ThreadCache11ListTooLongEPNS0_8FreeListEm+0x1C) [0x56263716bd7c]
 mongod-3.4(_ZdlPvRKSt9nothrow_t+0x26A) [0x562637aa1d7a]
 mongod-3.4(__wt_split_stash_discard+0xC6) [0x5626378e0466]
 mongod-3.4(+0x1EB55EA) [0x5626378e05ea]
 mongod-3.4(+0x1EB5C08) [0x5626378e0c08]
 mongod-3.4(__wt_split_reverse+0x83) [0x5626378e3ac3]
 mongod-3.4(+0x1F09294) [0x562637934294]
 mongod-3.4(__wt_evict+0xAC7) [0x562637934e47]
 mongod-3.4(+0x1F04813) [0x56263792f813]
 mongod-3.4(+0x1F04BA7) [0x56263792fba7]
 mongod-3.4(__wt_evict_thread_run+0xD3) [0x5626379316a3]
 mongod-3.4(__wt_thread_run+0x16) [0x56263799b906]
 libpthread.so.0(+0x7AA1) [0x7f74ee95caa1]
 libc.so.6(clone+0x6D) [0x7f74ee6a9c4d]
----- END BACKTRACE -----

 

Please note that we are in the process of generating a core dump and will attach here when the mongod process crashes again.



 Comments   
Comment by Alexander Gorrod [ 29/May/19 ]

Hi manan@indeed.com sorry for the slow reply here - we have been hoping to find a potential root cause, but have been unable to identify anything definite. The information you have uploaded has been useful, but isn't enough to allow us to identify the root cause.

What would be most useful would be if you can give us a description of steps that we can follow to reproduce the issue internally. We've been trying from the information we've received so far and haven't been able to reproduce.

Failing that, if you see the issue again it would be helpful if you can upload data from the incident on a current version of 3.6 or 4.0. We've extended the system health information that we collect in those most recent versions, so there might be a hint in the data. Specifically it would be useful if you could upload:

  • The mongod.log file
  • The content of the diagnostic.data directory
  • Syslog messages
  • The results of running "ls -l" on the database home directory

I know you've uploaded that information before and we haven't been able to find a root cause, and it's not highly likely to lead to a root cause in the future - but it's the best avenue we have for exploring at the moment absent a reproducer.

You can use the same upload portal.

Comment by Jackie Chu [ 03/May/19 ]

There are other regression issues in 3.6 that prevents us from upgrading at the moment.

From the discussion in the thread, the root cause was never found. All v3.2, v3.4, v3.6 can crash for the same seg fault signal (the only difference is the occurrence frequency), so I don't think it's mixing the issue here.

Comment by Manan Shah [ 03/May/19 ]

@Jackie Chu,  You should upgrade it to 3.6. This ticket now is for resolving the crash problem on 3.6. Please do not mix in your issues here.

Comment by Jackie Chu [ 03/May/19 ]

We are seeing similar issue (instance [mongodb 3.4] crashed after running for ~5 days)

 

2019-05-03 07:37:16 production-mongo-0-0   pattern not match: "----- END BACKTRACE -----"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " libc.so.6(clone+0x6D) [0x7f4cd3b9941d]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " libpthread.so.0(+0x76BA) [0x7f4cd3e636ba]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(+0x14EB101) [0x56492025c101]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(+0x8904CD) [0x56491f6014cd]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo23ServiceEntryPointMongod12_sessionLoopERKSt10shared_ptrINS_9transport7SessionEE+0x1FD) [0x56491f600b9d]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x746) [0x56491fa020f6]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(+0xC8F0F2) [0x56491fa000f2]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo11runCommandsEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS2_21ReplyBuilderInterfaceE+0x23B) [0x56491fe00beb]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_RKNS_3rpc16RequestInterfaceEPNS4_21ReplyBuilderInterfaceE+0xF81) [0x56491f7e1be1]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo7Command3runEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS3_21ReplyBuilderInterfaceE+0x935) [0x56491f7e08f5]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(+0xB33FAC) [0x56491f8a4fac]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(ZN5mongo9CmdUpdate7runImplEPNS_16OperationContextERKNSt7_cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_7BSONObjERNS_14BSONObjBuilderE+0x4F) [0x56491f8a965f]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo14performUpdatesEPNS_16OperationContextERKNS_8UpdateOpE+0x716) [0x56491fa87e56]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo12PlanExecutor11executePlanEv+0x6D) [0x56491fc2a03d]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x4B) [0x56491fc29f0b]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo12PlanExecutor11getNextImplEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0x19A) [0x56491fc295ea]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo9PlanStage4workEPm+0x63) [0x56491f91e2b3]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo11UpdateStage6doWorkEPm+0x547) [0x56491f952417]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo11UpdateStage18transformAndUpdateERKNS_11SnapshottedINS_7BSONObjEEERNS_8RecordIdE+0xDC2) [0x56491f951972]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo22WiredTigerRecoveryUnit7_commitEv+0x92) [0x56491fff79f2]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo22WiredTigerRecoveryUnit9_txnCloseEb+0xB6) [0x56491fff76c6]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(+0x1F757F9) [0x564920ce67f9]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(__wt_txn_commit+0x23D) [0x564920cf62dd]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(__wt_log_write+0x907) [0x564920c94cb7]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(__wt_log_release+0x135) [0x564920c92b85]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(__wt_checkpoint_signal+0x34) [0x564920c5f064]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(__wt_cond_signal+0x17) [0x564920cb3e97]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " libpthread.so.0(+0x11390) [0x7f4cd3e6d390]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(+0x158BB66) [0x5649202fcb66]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(+0x158B4F9) [0x5649202fc4f9]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x5649202fd2e1]"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: "----- BEGIN BACKTRACE -----"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: " 0x5649202fd2e1 0x5649202fc4f9 0x5649202fcb66 0x7f4cd3e6d390 0x564920cb3e97 0x564920c5f064 0x564920c92b85 0x564920c94cb7 0x564920cf62dd 0x564920ce67f9 0x56491fff76c6 0x56491fff79f2 0x56491f951972 0x56491f952417 0x56491f91e2b3 0x56491fc295ea 0x56491fc29f0b 0x56491fc2a03d 0x56491fa87e56 0x56491f8a965f 0x56491f8a4fac 0x56491f7e08f5 0x56491f7e1be1 0x56491fe00beb 0x56491fa000f2 0x56491fa020f6 0x56491f600b9d 0x56491f6014cd 0x56492025c101 0x7f4cd3e636ba 0x7f4cd3b9941d"
2019-05-03 07:37:16 production-mongo-0-0   pattern not match: ""
2019-05-03 07:37:16 production-mongo-0-0 FATAL Invalid access at address: 0
2019-05-03 07:37:16 production-mongo-0-0 FATAL Got signal: 11 (Segmentation fault).
2019-05-03 07:37:16 production-mongo-0-0 INFO end connection 127.0.0.1:49206 (89 connections now open)

 

Comment by Manan Shah [ 11/Mar/19 ]

Hi, the instance crashed after nearly running for 24 days. It was gracefully restarted on 2/13. The other 3.6 instance has been running for over 47 days now. My hunch is that it will crash once after a graceful restart, as it also happened the first time in 7 days after we upgraded to 3.6.

Here's the backtrace for the most recent crash:

2019-03-09T21:22:00.476-0600 F -        [thread637278] Invalid access at address: 0
2019-03-09T21:22:00.506-0600 F -        [thread637278] Got signal: 11 (Segmentation fault). 0x556198c53ca1 0x556198c52eb9 0x556198c53526 0x7fb1d60665d0 0x556198d40a03 0x556198d40adc 0x556198de32da 0x5561974f40b6 0x5561974f4136 0x5561974f4780 0x55619750d80c 0x5561974885a4 0x556197480be4 0x556197481153 0x556197483fc7 0x5561974db249 0x7fb1d605edd5 0x7fb1d5d87ead
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"5561969F5000","o":"225ECA1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"5561969F5000","o":"225DEB9"},{"b":"5561969F5000","o":"225E526"},{"b":"7FB1D6057000","o":"F5D0"},{"b":"5561969F5000","o":"234BA03","s":"_ZN8tcmalloc11ThreadCache21ReleaseToCentralCacheEPNS0_8FreeListEmi"},{"b":"5561969F5000","o":"234BADC","s":"_ZN8tcmalloc11ThreadCache11ListTooLongEPNS0_8FreeListEm"},{"b":"5561969F5000","o":"23EE2DA","s":"_ZdlPvRKSt9nothrow_t"},{"b":"5561969F5000","o":"AFF0B6"},{"b":"5561969F5000","o":"AFF136"},{"b":"5561969F5000","o":"AFF780","s":"__wt_page_out"},{"b":"5561969F5000","o":"B1880C","s":"__wt_split_multi"},{"b":"5561969F5000","o":"A935A4","s":"__wt_evict"},{"b":"5561969F5000","o":"A8BBE4"},{"b":"5561969F5000","o":"A8C153"},{"b":"5561969F5000","o":"A8EFC7","s":"__wt_evict_thread_run"},{"b":"5561969F5000","o":"AE6249"},{"b":"7FB1D6057000","o":"7DD5"},{"b":"7FB1D5C8A000","o":"FDEAD","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.9", "gitVersion" : "167861a164723168adfaaa866f310cb94010428f", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.10.0-957.1.3.el7.x86_64", "version" : "#1 SMP Thu Nov 29 14:49:43 UTC 2018", "machine" : "x86_64" }, "somap" : [ { "b" : "5561969F5000", "elfType" : 3, "buildId" : "AB104B86C600DDB6EA0E1D9D0CE5544A34FF48A2" }, { "b" : "7FFD03491000", "elfType" : 3, "buildId" : "DF8F6BF69E976BF1266E476EA2E37CEE06F10C1D" }, { "b" : "7FB1D726A000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "4C488F6E7044BB966162C1F7081ABBA6EBB2B485" }, { "b" : "7FB1D6E09000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "8BD89856B64DD5189BF075EF574EDF203F93D44A" }, { "b" : "7FB1D6B97000", "path" : "/lib64/libssl.so.10", "elfType" : 3, "buildId" : "AEF5E6F2240B55F90E9DF76CFBB8B9D9F5286583" }, { "b" : "7FB1D6993000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "67AD3498AC7DE3EAB952A243094DF5C12A21CD7D" }, { "b" : "7FB1D678B000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "EFDE2029C9A4A20BE5B8D8AE7E6551FF9B5755D2" }, { "b" : "7FB1D6489000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "918D3696BF321AA8D32950AB2AB8D0F1B21AC907" }, { "b" : "7FB1D6273000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "6B4F3D896CD0F06FCB3DEF0245F204ECE3220D7E" }, { "b" : "7FB1D6057000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "3D9441083D079DC2977F1BD50C8068D11767232D" }, { "b" : "7FB1D5C8A000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "3C61131D1DAC9DA79B73188E7702BEF786C2AD54" }, { "b" : "7FB1D7483000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "5DA2D47925497B2F5875A7D8D1799A1227E2FDE4" }, { "b" : "7FB1D5A74000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "B9D5F73428BD6AD68C96986B57BEA3B7CEDB9745" }, { "b" : "7FB1D5827000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "B5C83BDE7ED7026835B779FA0F957FCCCD599F40" }, { "b" : "7FB1D553E000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "8B63976509135BA73A12153D6FDF7B3B9E5D2A54" }, { "b" : "7FB1D533A000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "B4BE1023D9606A88169DF411BF94AF417D7BA1A0" }, { "b" : "7FB1D511F000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "6183129B5F29CA14580E517DF94EF317761FA6C9" }, { "b" : "7FB1D4F10000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "98F619035053EF68358099CE7CF1AA528B3B229D" }, { "b" : "7FB1D4D0C000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "2E01D5AC08C1280D013AAB96B292AC58BC30A263" }, { "b" : "7FB1D4AE5000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "D2DD4DA3FDE1477D25BFFF80F3A25FDB541A8179" }, { "b" : "7FB1D4883000", "path" : "/lib64/libpcre.so.1", "elfType" : 3, "buildId" : "9CA3D11F018BEEB719CDB34BE800BF1641350D0A" } ] }}
 mongod-3.6(_ZN5mongo15printStackTraceERSo+0x41) [0x556198c53ca1]
 mongod-3.6(+0x225DEB9) [0x556198c52eb9]
 mongod-3.6(+0x225E526) [0x556198c53526]
 libpthread.so.0(+0xF5D0) [0x7fb1d60665d0]
 mongod-3.6(_ZN8tcmalloc11ThreadCache21ReleaseToCentralCacheEPNS0_8FreeListEmi+0xE3) [0x556198d40a03]
 mongod-3.6(_ZN8tcmalloc11ThreadCache11ListTooLongEPNS0_8FreeListEm+0x1C) [0x556198d40adc]
 mongod-3.6(_ZdlPvRKSt9nothrow_t+0x26A) [0x556198de32da]
 mongod-3.6(+0xAFF0B6) [0x5561974f40b6]
 mongod-3.6(+0xAFF136) [0x5561974f4136]
 mongod-3.6(__wt_page_out+0x590) [0x5561974f4780]
 mongod-3.6(__wt_split_multi+0x49C) [0x55619750d80c]
 mongod-3.6(__wt_evict+0x10F4) [0x5561974885a4]
 mongod-3.6(+0xA8BBE4) [0x556197480be4]
 mongod-3.6(+0xA8C153) [0x556197481153]
 mongod-3.6(__wt_evict_thread_run+0x77) [0x556197483fc7]
 mongod-3.6(+0xAE6249) [0x5561974db249]
 libpthread.so.0(+0x7DD5) [0x7fb1d605edd5]
 libc.so.6(clone+0x6D) [0x7fb1d5d87ead]
-----  END BACKTRACE  ----- 

Comment by Alexander Gorrod [ 19/Feb/19 ]

Thanks for providing all the information and logs manan@indeed.com - sorry we weren't able to isolate a more definite root cause.

Please re-open if you start seeing the failure again

Comment by Manan Shah [ 08/Feb/19 ]

After the initial crash post-3.6 upgrade, the problem has not occurred since. The problem always occurred on the two hosts that are identical, one of which I sent the diagnostic info for and also the core dumps that I sent earlier when version was 3.4 on this same host. It has not occurred even once on the PRIMARY of this same replica set (running version 3.4.14) that is on Cent6 vs Cent7. The problem first seen was after an initial sync to 3.6, so I'm reluctant to do initial sync again for TBs of this data.

We have seen such problem randomly on some of our newer servers running Cent7 for a few different replica sets. It never occurred on Cent6. Feel free to put this in backlog for now. If this happened again after we fail over to this host as the PRIMARY, then I shall let you know.

Thank you very much.

Comment by Danny Hatcher (Inactive) [ 08/Feb/19 ]

Hello Manan,

In a previous comment, you mentioned that after the initial crash post-3.6 upgrade you have not encountered the issue again. Can you please confirm whether or not your servers have been stable since that time?

Additionally, has this problem ever occurred on different nodes or is it always constrained to the same one? If you initial sync the problem node, do the crashes stop occurring?

Unfortunately, after multiple engineers have spent some time looking into this problem, we do not have an answer as to why the server kept crashing. The code path being accessed is very common and we do not have reports of other customers experiencing the same issue which may indicate that there is something relatively unique to your environment triggering the problem. Further investigation on our side would require significant developer time which is not available right now. I'm afraid that we will have to put this ticket into our backlog unless further crashes reveal a smoking gun.

Thank you,

Danny

Comment by Danny Hatcher (Inactive) [ 07/Feb/19 ]

Hello manan@indeed.com,

I apologize for the delay in response. We are actively looking into it but have not found anything conclusive as a root cause. The data you've provided thus far has been very helpful but I will let you know as soon as possible if we need anything further to continue investigating.

Thank you,

Danny

Comment by Manan Shah [ 06/Feb/19 ]

Danny / Vamshi I would really appreciate an update from either of you. Was there any information missing for proper investigation here?

Comment by Manan Shah [ 30/Jan/19 ]

Hi, is there any update from the investigating Team or do you need more information? We are potentially blocked upon utilizing these servers for this database only because the instance could crash hence unreliable for production. We would like to fail over the Primary on this host if we can confidently say it wouldn't crash.

Comment by Manan Shah [ 28/Jan/19 ]

Hi Danny, I also wanted to update that the instances have been stable after the initial crash on 3.6.9, for over 9 days on server 1 and for 6 days on server 2. The instances never had a continuous run for more than 6 days on any of those servers previously on version 3.4 and 3.2 earlier. So it is worth noting. I'll update you if this crashed again. But I'm still curious to learn why they crashed the first time, and whether these instances are ready to serve very critical production traffic.

Comment by Manan Shah [ 24/Jan/19 ]

Ok, do let me know if you need more files. I also uploaded the system log around the crash.

Comment by Danny Hatcher (Inactive) [ 24/Jan/19 ]

Manan,

Yes, I can confirm that we now have the files.

Danny

Comment by Manan Shah [ 24/Jan/19 ]

I was missing the "@" earlier, sorry about that. Can you pls check now?

Comment by Danny Hatcher (Inactive) [ 24/Jan/19 ]

Hello Manan,

I do see some files now but they simply contain ASCII text of other filenames so I believe the upload was incorrect. You should be able to copy/paste the curl command example in the portal with the substitution of the final -F argument's file.name with the actual file name on your system. For example,

-F "file=@file.name" \

becomes

-F "file=@mongod.log" \

Thank you,

Danny

Comment by Manan Shah [ 24/Jan/19 ]

I think there was an issue with curl command in the upload earlier. I think I rectified the error and as a test I uploaded 3 files just now. Can you verify if you received?

Comment by Danny Hatcher (Inactive) [ 24/Jan/19 ]

Hello Manan,

I see no files uploaded in the folder after November 28th. I just uploaded a test to confirm that the folder should work. Could you please re-upload everything you uploaded recently?

Thank you,

Danny

Comment by Manan Shah [ 24/Jan/19 ]

Danny, are you referring to the mongo log or the diagnostic data file? The same mongo log contains both crash info from version 3.4 as well as 3.6. The 3.6 crash in the log comes at timestamp 2019-01-18T17:53:45. And I think the file metrics.2019-01-18T23-54-11Z-00000 or metrics.2019-01-18T20-35-46Z-00000 should also contain the data for the 3.6 crash.

Let me know if you didn't get these logs. I'll re-upload them.

Comment by Danny Hatcher (Inactive) [ 24/Jan/19 ]

Hello Manan,

I see logs from back in November when you having the problems on 3.4.14 but I do not see any information from the latest crashes on 3.6.9. Would it be possible to upload the 3.6.9 info so we can directly compare the two versions? If you have tried to upload the 3.6.9 files already, could you try again?

Thank you for your patience and assistance,

Danny

Comment by Manan Shah [ 23/Jan/19 ]

Danny These have been uploaded. Let me know if you didn't find any or the ones that does not have the data you are expecting to see.

Comment by Danny Hatcher (Inactive) [ 23/Jan/19 ]

Hello Manan,

Could you also provide the mongod logs, system logs, and "diagnostic.data" folder from the node that went down? You can use the same upload portal you used before.

Thank you,

Danny

Comment by Danny Hatcher (Inactive) [ 22/Jan/19 ]

Hello Manan,

Thank you for keeping us updated and I am sorry that you are still experiencing instability within your system. I am working with our Developers to see what the next steps should be. We will reach out to you as soon as possible.

Thank you for your patience,

Danny

Comment by Manan Shah [ 22/Jan/19 ]

We had the second instance also crashed with same error Got signal: 11 (Segmentation fault) after ~3 days from its start on version 3.6.9. Do you have any further remedy? This is affecting our production environment.

Comment by Manan Shah [ 18/Jan/19 ]

The instance just crashed even on 3.6 after 3 consecutive days. Here's the trace:

2019-01-18T15:14:07.704-0600 F -        [thread79802] Invalid access at address: 0
2019-01-18T15:14:07.724-0600 F -        [thread79802] Got signal: 11 (Segmentation fault). 0x55da590b5ca1 0x55da590b4eb9 0x55da590b5526 0x7f078e37d5d0 0x55da591a2a03 0x55da591a2adc 0x55da592452da 0x55da579565ac 0x55da578ea4ab 0x55da578e2e79 0x55da578e3153 0x55da578e5fc7 0x55da5793d249 0x7f078e375dd5 0x7f078e09eead
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"55DA56E57000","o":"225ECA1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"55DA56E57000","o":"225DEB9"},{"b":"55DA56E57000","o":"225E526"},{"b":"7F078E36E000","o":"F5D0"},{"b":"55DA56E57000","o":"234BA03","s":"_ZN8tcmalloc11ThreadCache21ReleaseToCentralCacheEPNS0_8FreeListEmi"},{"b":"55DA56E57000","o":"234BADC","s":"_ZN8tcmalloc11ThreadCache11ListTooLongEPNS0_8FreeListEm"},{"b":"55DA56E57000","o":"23EE2DA","s":"_ZdlPvRKSt9nothrow_t"},{"b":"55DA56E57000","o":"AFF5AC","s":"__wt_page_out"},{"b":"55DA56E57000","o":"A934AB","s":"__wt_evict"},{"b":"55DA56E57000","o":"A8BE79"},{"b":"55DA56E57000","o":"A8C153"},{"b":"55DA56E57000","o":"A8EFC7","s":"__wt_evict_thread_run"},{"b":"55DA56E57000","o":"AE6249"},{"b":"7F078E36E000","o":"7DD5"},{"b":"7F078DFA1000","o":"FDEAD","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.9", "gitVersion" : "167861a164723168adfaaa866f310cb94010428f", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.10.0-957.1.3.el7.x86_64", "version" : "#1 SMP Thu Nov 29 14:49:43 UTC 2018", "machine" : "x86_64" }, "somap" : [ { "b" : "55DA56E57000", "elfType" : 3, "buildId" : "AB104B86C600DDB6EA0E1D9D0CE5544A34FF48A2" }, { "b" : "7FFFED0DF000", "elfType" : 3, "buildId" : "DF8F6BF69E976BF1266E476EA2E37CEE06F10C1D" }, { "b" : "7F078F581000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "4C488F6E7044BB966162C1F7081ABBA6EBB2B485" }, { "b" : "7F078F120000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "8BD89856B64DD5189BF075EF574EDF203F93D44A" }, { "b" : "7F078EEAE000", "path" : "/lib64/libssl.so.10", "elfType" : 3, "buildId" : "AEF5E6F2240B55F90E9DF76CFBB8B9D9F5286583" }, { "b" : "7F078ECAA000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "67AD3498AC7DE3EAB952A243094DF5C12A21CD7D" }, { "b" : "7F078EAA2000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "EFDE2029C9A4A20BE5B8D8AE7E6551FF9B5755D2" }, { "b" : "7F078E7A0000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "918D3696BF321AA8D32950AB2AB8D0F1B21AC907" }, { "b" : "7F078E58A000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "6B4F3D896CD0F06FCB3DEF0245F204ECE3220D7E" }, { "b" : "7F078E36E000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "3D9441083D079DC2977F1BD50C8068D11767232D" }, { "b" : "7F078DFA1000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "3C61131D1DAC9DA79B73188E7702BEF786C2AD54" }, { "b" : "7F078F79A000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "5DA2D47925497B2F5875A7D8D1799A1227E2FDE4" }, { "b" : "7F078DD8B000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "B9D5F73428BD6AD68C96986B57BEA3B7CEDB9745" }, { "b" : "7F078DB3E000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "B5C83BDE7ED7026835B779FA0F957FCCCD599F40" }, { "b" : "7F078D855000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "8B63976509135BA73A12153D6FDF7B3B9E5D2A54" }, { "b" : "7F078D651000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "B4BE1023D9606A88169DF411BF94AF417D7BA1A0" }, { "b" : "7F078D436000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "6183129B5F29CA14580E517DF94EF317761FA6C9" }, { "b" : "7F078D227000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "98F619035053EF68358099CE7CF1AA528B3B229D" }, { "b" : "7F078D023000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "2E01D5AC08C1280D013AAB96B292AC58BC30A263" }, { "b" : "7F078CDFC000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "D2DD4DA3FDE1477D25BFFF80F3A25FDB541A8179" }, { "b" : "7F078CB9A000", "path" : "/lib64/libpcre.so.1", "elfType" : 3, "buildId" : "9CA3D11F018BEEB719CDB34BE800BF1641350D0A" } ] }}
 mongod-3.6(_ZN5mongo15printStackTraceERSo+0x41) [0x55da590b5ca1]
 mongod-3.6(+0x225DEB9) [0x55da590b4eb9]
 mongod-3.6(+0x225E526) [0x55da590b5526]
 libpthread.so.0(+0xF5D0) [0x7f078e37d5d0]
 mongod-3.6(_ZN8tcmalloc11ThreadCache21ReleaseToCentralCacheEPNS0_8FreeListEmi+0xE3) [0x55da591a2a03]
 mongod-3.6(_ZN8tcmalloc11ThreadCache11ListTooLongEPNS0_8FreeListEm+0x1C) [0x55da591a2adc]
 mongod-3.6(_ZdlPvRKSt9nothrow_t+0x26A) [0x55da592452da]
 mongod-3.6(__wt_page_out+0x3BC) [0x55da579565ac]
 mongod-3.6(__wt_evict+0xFFB) [0x55da578ea4ab]
 mongod-3.6(+0xA8BE79) [0x55da578e2e79]
 mongod-3.6(+0xA8C153) [0x55da578e3153]
 mongod-3.6(__wt_evict_thread_run+0x77) [0x55da578e5fc7]
 mongod-3.6(+0xAE6249) [0x55da5793d249]
 libpthread.so.0(+0x7DD5) [0x7f078e375dd5]
 libc.so.6(clone+0x6D) [0x7f078e09eead]
-----  END BACKTRACE  -----
 

 

So what do you now suggest as a solution?

Comment by Manan Shah [ 15/Jan/19 ]

So finally just started one of the two instances on version 3.6.9. Here's the log extract (some info x'ed out)

2019-01-15T14:28:22.849-0600 I CONTROL [initandlisten] MongoDB starting : pid=16452 port=xx dbpath=/path 64-bit host=xx
2019-01-15T14:28:22.849-0600 I CONTROL [initandlisten] db version v3.6.9
2019-01-15T14:28:22.849-0600 I CONTROL [initandlisten] git version: xx
2019-01-15T14:28:22.849-0600 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2019-01-15T14:28:22.849-0600 I CONTROL [initandlisten] allocator: tcmalloc
2019-01-15T14:28:22.849-0600 I CONTROL [initandlisten] modules: none
2019-01-15T14:28:22.849-0600 I CONTROL [initandlisten] build environment:
2019-01-15T14:28:22.849-0600 I CONTROL [initandlisten] distmod: rhel70
2019-01-15T14:28:22.849-0600 I CONTROL [initandlisten] distarch: x86_64
2019-01-15T14:28:22.849-0600 I CONTROL [initandlisten] target_arch: x86_64 

We shall see if this crashes within the next week or so, will update.

Comment by Danny Hatcher (Inactive) [ 11/Jan/19 ]

Hello,

No, I am not sure what the issue is with the log. However, it may be related to the issue causing the problem and the upgrade solves that. If not, we can investigate separately.

Thank you,

Danny

Comment by Manan Shah [ 11/Jan/19 ]

Yes Danny, we are currently in the process of upgrading. QA is done and has been set featureCompatibilityVersion to 3.4 but prod to follow soon.

Certain things as per your documention https://docs.mongodb.com/v3.6/release-notes/3.6-upgrade-replica-set/ will slow the progress, you know, but I'm targeting the production upgrade of the crashing instance (it's a hidden secondary for now so as not to affect the application) next week.

Are you implying that the trace lines not being logged is also some other bug in 3.4 and will also be resolved in 3.6?

Comment by Danny Hatcher (Inactive) [ 11/Jan/19 ]

Hello,

Would it be feasible to try upgrading to 3.6 sooner rather than later? I know that there is overhead to that process but we think that there is a strong chance that you will no longer encounter the issue in that version.

Thank you,

Danny

Comment by Manan Shah [ 07/Jan/19 ]

Hi Danny,

It is a random behavior in the sense that it logs the trace sometimes and omits other times. Here's an example of the last lines in the mongod.log for the crashed instance. After the invalid access at address: 0 line, nothing was logged.

2019-01-03T10:53:20.662-0600 I NETWORK  [thread1] connection accepted from 127.0.0.1:38206 #900 (14 connections now open)
2019-01-03T10:53:20.664-0600 I NETWORK  [conn900] received client metadata from 127.0.0.1:38206 conn900: { driver: { name: "PyMongo", version: "3.5.1" }, os: { type: "Linux", name: "CentOS Linux 7.6.1810 Core", architecture: "x86_64", version: "3.10.0-957.1.3.el7.x86_64" }, platform: "CPython 2.7.15.final.0" }
2019-01-03T10:53:20.665-0600 I -        [conn900] end connection 127.0.0.1:38206 (14 connections now open)
2019-01-03T10:53:20.850-0600 F -        [thread2] Invalid access at address: 0
 
 

Comment by Danny Hatcher (Inactive) [ 02/Jan/19 ]

Hello Manan,

Do you mean that it no longer includes the "backtrace" section after the segfault? Could you please provide a log file showing this?

Thank you,

Danny

Comment by Manan Shah [ 21/Dec/18 ]

Thanks, that's what we are hoping to try next but since our java driver is at 3.4, it will take significant testing to upgrade that first to the latest version before we upgrade the server version to 3.6. This being a business critical production database server running Mongo, it might take a few weeks to get there but I'll be sure to update whatever is the outcome.

Also note that now the 3.4.18 instance crashes with invalid memory access at address 0 error frequently about once a day. And now it does not even log the stack trace in the mongod.log.

Comment by Danny Hatcher (Inactive) [ 21/Dec/18 ]

Hello Manan,

We believe that WT-3076 has the best chance to fix your system. When you are able, please upgrade to 3.6 and let us know if you still experience crashes.

Thank you,

Danny

Comment by Manan Shah [ 19/Dec/18 ]

Well, I called it too soon.

It did crash early today after an almost 6 days of uptime. Here's the stack trace.

2018-12-19T03:00:47.996-0600 F -        [thread2] Invalid access at address: 0
2018-12-19T03:00:48.011-0600 F -        [thread2] Got signal: 11 (Segmentation fault). 0x55b9b9ab0cb1 0x55b9b9aafec9 0x55b9b9ab0536 0x7f86a09f35d0 0x55b9ba5b0830 0x55b9ba4621fb 0x55b9ba3ed4ae 0x55b9ba3f0843 0x55b9ba441014 0x55b9ba441bc7 0x55b9ba43c593 0x55b9ba43c927 0x55b9ba43e423 0x55b9ba4a8536 0x7f86a09ebdd5 0x7f86a0714ead
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"55B9B8522000","o":"158ECB1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"55B9B8522000","o":"158DEC9"},{"b":"55B9B8522000","o":"158E536"},{"b":"7F86A09E4000","o":"F5D0"},{"b":"55B9B8522000","o":"208E830","s":"tc_calloc"},{"b":"55B9B8522000","o":"1F401FB","s":"__wt_calloc"},{"b":"55B9B8522000","o":"1ECB4AE"},{"b":"55B9B8522000","o":"1ECE843","s":"__wt_split_reverse"},{"b":"55B9B8522000","o":"1F1F014"},{"b":"55B9B8522000","o":"1F1FBC7","s":"__wt_evict"},{"b":"55B9B8522000","o":"1F1A593"},{"b":"55B9B8522000","o":"1F1A927"},{"b":"55B9B8522000","o":"1F1C423","s":"__wt_evict_thread_run"},{"b":"55B9B8522000","o":"1F86536","s":"__wt_thread_run"},{"b":"7F86A09E4000","o":"7DD5"},{"b":"7F86A0617000","o":"FDEAD","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.18", "gitVersion" : "4410706bef6463369ea2f42399e9843903b31923", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.10.0-957.1.3.el7.x86_64", "version" : "#1 SMP Thu Nov 29 14:49:43 UTC 2018", "machine" : "x86_64" }, "somap" : [ { "b" : "55B9B8522000", "elfType" : 3, "buildId" : "2346B88195258DF708C923592FA0F13714798067" }, { "b" : "7FFDFC6F5000", "elfType" : 3, "buildId" : "DF8F6BF69E976BF1266E476EA2E37CEE06F10C1D" }, { "b" : "7F86A1985000", "path" : "/lib64/libssl.so.10", "elfType" : 3, "buildId" : "AEF5E6F2240B55F90E9DF76CFBB8B9D9F5286583" }, { "b" : "7F86A1524000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "8BD89856B64DD5189BF075EF574EDF203F93D44A" }, { "b" : "7F86A131C000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "EFDE2029C9A4A20BE5B8D8AE7E6551FF9B5755D2" }, { "b" : "7F86A1118000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "67AD3498AC7DE3EAB952A243094DF5C12A21CD7D" }, { "b" : "7F86A0E16000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "918D3696BF321AA8D32950AB2AB8D0F1B21AC907" }, { "b" : "7F86A0C00000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "6B4F3D896CD0F06FCB3DEF0245F204ECE3220D7E" }, { "b" : "7F86A09E4000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "3D9441083D079DC2977F1BD50C8068D11767232D" }, { "b" : "7F86A0617000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "3C61131D1DAC9DA79B73188E7702BEF786C2AD54" }, { "b" : "7F86A1BF7000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "5DA2D47925497B2F5875A7D8D1799A1227E2FDE4" }, { "b" : "7F86A03CA000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "B5C83BDE7ED7026835B779FA0F957FCCCD599F40" }, { "b" : "7F86A00E1000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "8B63976509135BA73A12153D6FDF7B3B9E5D2A54" }, { "b" : "7F869FEDD000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "B4BE1023D9606A88169DF411BF94AF417D7BA1A0" }, { "b" : "7F869FCC2000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "6183129B5F29CA14580E517DF94EF317761FA6C9" }, { "b" : "7F869FAAC000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "B9D5F73428BD6AD68C96986B57BEA3B7CEDB9745" }, { "b" : "7F869F89D000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "98F619035053EF68358099CE7CF1AA528B3B229D" }, { "b" : "7F869F699000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "2E01D5AC08C1280D013AAB96B292AC58BC30A263" }, { "b" : "7F869F480000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "4C488F6E7044BB966162C1F7081ABBA6EBB2B485" }, { "b" : "7F869F259000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "D2DD4DA3FDE1477D25BFFF80F3A25FDB541A8179" }, { "b" : "7F869EFF7000", "path" : "/lib64/libpcre.so.1", "elfType" : 3, "buildId" : "9CA3D11F018BEEB719CDB34BE800BF1641350D0A" } ] }}
 mongod-3.4(_ZN5mongo15printStackTraceERSo+0x41) [0x55b9b9ab0cb1]
 mongod-3.4(+0x158DEC9) [0x55b9b9aafec9]
 mongod-3.4(+0x158E536) [0x55b9b9ab0536]
 libpthread.so.0(+0xF5D0) [0x7f86a09f35d0]
 mongod-3.4(tc_calloc+0xA0) [0x55b9ba5b0830]
 mongod-3.4(__wt_calloc+0x3B) [0x55b9ba4621fb]
 mongod-3.4(+0x1ECB4AE) [0x55b9ba3ed4ae]
 mongod-3.4(__wt_split_reverse+0x83) [0x55b9ba3f0843]
 mongod-3.4(+0x1F1F014) [0x55b9ba441014]
 mongod-3.4(__wt_evict+0xAC7) [0x55b9ba441bc7]
 mongod-3.4(+0x1F1A593) [0x55b9ba43c593]
 mongod-3.4(+0x1F1A927) [0x55b9ba43c927]
 mongod-3.4(__wt_evict_thread_run+0xD3) [0x55b9ba43e423]
 mongod-3.4(__wt_thread_run+0x16) [0x55b9ba4a8536]
 libpthread.so.0(+0x7DD5) [0x7f86a09ebdd5]
 libc.so.6(clone+0x6D) [0x7f86a0714ead]
-----  END BACKTRACE  -----
 

Comment by Manan Shah [ 18/Dec/18 ]

Just wanted to update that we upgraded the OS on the system to CentOS Linux release 7.6.1810 (from CentOS 6.9) and restarted the crashing instance on mongo ver 3.4.18. It has been stable for the past six days. I'm hoping this resolved the issue.

We aren't ready to upgrade to 3.6 yet but hopefully that shall work too on the new Cent7 system.

Comment by Alexander Gorrod [ 06/Dec/18 ]

Hi manan@indeed.com sorry for the slow response - this appears to be a memory corruption issue - and those are notoriously difficult to isolate and fix without fast reproducers.

There have been a few WiredTiger changes that fixed related code: WT-4037 was fixed in 3.4.16 - it's possible but fairly unlikely to have fixed the issue you are seeing. WT-3076 was fixed in the 3.6 release - it's more likely to have fixed the problem you are seeing. Could you upgrade to 3.6 current release of MongoDB and let us know if the issue is resolved?

Comment by Manan Shah [ 05/Dec/18 ]

Can you provide an update if it is even possible to fix this with the current MongoDB version? The crash occurs once in about 2 days on this version of the kernel and OS.

Comment by Manan Shah [ 28/Nov/18 ]

The core dump parts uploads are on way. Let me know if you did not get those.

Comment by Ramon Fernandez Marina [ 28/Nov/18 ]

Correction: while SERVER-32170 has the same ultimate failure than this ticket, keith.bostic points out that is triggered from a different code path (__wt_split_stash_discard).

Comment by Kelsey Schubert [ 28/Nov/18 ]

Hi manan@indeed.com,

Thank you for providing these files, we'll investigate. The core dump may help us identify the root cause of this behavior, and we'd appreciate it if you could provide it as well. Unfortunately, the portal has a 5GB limit which not be increased. However, there's an easy workaround, which is to use the split command as follows:

split -d -b 5300000000 filename.tgz part.

This will produce a series of part.XX where XX is a number; you can then upload these files via the S3 portal and we'll stitch them back together.

Since these are large files, it'd be helpful if you could provide their checksums (e.g. the output of cksum on each part) as well so we can confirm their integrity after downloading.

Thanks again for your help,
Kelsey

Comment by Manan Shah [ 28/Nov/18 ]

Apart from the ones I uploaded thru secure portal, I now have a core dump from the most recent crash on the host. Should I upload that as well? It is of 36G size.

Comment by Kelsey Schubert [ 28/Nov/18 ]

Hi manan@indeed.com,

I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time.

Thanks,
Kelsey

Comment by Manan Shah [ 28/Nov/18 ]

This happens 2-3 times a day for mongod version 3.2.19 and has occurred once in past 24 hours in mongod-3.4.14 since the process was gracefully restarted 1 day ago (and this is the first instance of it crashing on version 3.4.14). Here's some of the info you requested.

local_dmesg.txt

diagnostic.data zip file is 194M. Should I upload here?

Nothing found in system log during the time of crash (but here's the extract)

local_system_log_Nov28.log

Cannot attach the mongod.log due to security reasons.

Comment by Ramon Fernandez Marina [ 28/Nov/18 ]

Thanks for opening a new ticket manan@indeed.com. As previously mentioned this in SERVER-33176 this is unrelated to WT-4037.

We have seen this stack trace before in SERVER-32170, but unfortunately were not able to reproduce the problem at the time. Can you please provide the following:

  • Some more information about when this happens (e.g.: only once, every day, etc.)
  • The contents of the diagnostic.data directory
  • The mongod.log file for this node showing the crash
  • The contents of your system log (for at least a time window that overlaps with the problem)
  • The output of the dmesg command on this node

Thanks,
Ramón.

Generated at Thu Feb 08 04:48:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.