[SERVER-43674] mongoDB crash with "Got signal: 11 (Segmentation fault)." Created: 27/Sep/19  Updated: 11/Dec/19  Resolved: 11/Dec/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jiri Sula Assignee: Sulabh Mahajan
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Mongo 3.6.7, git version: 2628472127e9f1826e02c665c1d93880a204075e
Debian Stretch x86_64, 4.4.112 kernel, xfs file system


Attachments: File mongod.27022.log-20190926.bz2    
Issue Links:
Related
related to SERVER-43672 Invariant failure session->cursorOut(... Closed
Operating System: ALL
Sprint: Storage Engines 2019-11-18, Storage Engines 2019-12-02, Storage Engines 2019-12-16
Participants:
Story Points: 5

 Description   

Good day!

MongoDB node crashed with "Got signal: 11 (Segmentation fault)."

There are no traces to blame WireShark (although direction https://jira.mongodb.org/browse/SERVER-31236 looks very familiar).

Thank you!

Sep 26 11:08:08 opmc3 mongod.27022[133152]: [conn22428213] killcursors: found 0 of 1
Sep 26 11:08:08 opmc3 mongod.27022[133152]: [conn22428213] killcursors oddsportal.system.profile numYields:0 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 1104368 } }, Collection: { acquireCount: { r: 1 } } } 1104ms
Sep 26 11:08:15 opmc3 mongod.27022[133152]: [conn22428213] Invalid access at address: 0x88
Sep 26 11:08:15 opmc3 mongod.27022[133152]: [conn22428213] Got signal: 11 (Segmentation fault).
0x55caa3f91ff1 0x55caa3f91209 0x55caa3f91876 0x55caa3076cf1 0x7fbfb29a80c0 0x55caa28e7666 0x55caa28906c1 0x55caa27b3d67 0x55caa27afb05 0x55caa2d388f0 0x55caa2d389b1 0x55caa2d435c7 0x55caa2d43967 0x55caa2d17748 0x55caa2d19e7c 0x55caa2d1a9ed 0x55caa2d1ad98 0x55caa2980fc5 0x55caa298bfaa 0x55caa2987967 0x55caa298ada1 0x55caa388d7e2 0x55caa29867cf 0x55caa2988d15 0x55caa298960b 0x55caa29879ed 0x55caa298ada1 0x55caa388dd45 0x55caa3e4b5a4 0x7fbfb299e494 0x7fbfb26e0acf
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"55CAA1D5A000","o":"2237FF1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"55CAA1D5A000","o":"2237209"},{"b":"55CAA1D5A000","o":"2237876"},{"b":"55CAA1D5A000","o":"131CCF1"},{"b":"7FBFB2997000","o":"110C0"},{"b":"55CAA1D5A000","o":"B8D666","s":"__wt_btcur_reset"},{"b":"55CAA1D5A000","o":"B366C1"},{"b":"55CAA1D5A000","o":"A59D67","s":"_ZN5mongo17WiredTigerSession13releaseCursorEmP11__wt_cursor"},{"b":"55CAA1D5A000","o":"A55B05","s":"_ZN5mongo35WiredTigerRecordStoreStandardCursorD0Ev"},{"b":"55CAA1D5A000","o":"FDE8F0","s":"_ZN5mongo14CollectionScanD1Ev"},{"b":"55CAA1D5A000","o":"FDE9B1","s":"_ZN5mongo14CollectionScanD0Ev"},{"b":"55CAA1D5A000","o":"FE95C7","s":"_ZN5mongo12PlanExecutorD1Ev"},{"b":"55CAA1D5A000","o":"FE9967","s":"_ZN5mongo12PlanExecutor7DeleterclEPS0_"},{"b":"55CAA1D5A000","o":"FBD748","s":"_ZN5mongo12ClientCursorD1Ev"},{"b":"55CAA1D5A000","o":"FBFE7C","s":"_ZN5mongo13CursorManager11eraseCursorEPNS_16OperationContextExb"},{"b":"55CAA1D5A000","o":"FC09ED"},{"b":"55CAA1D5A000","o":"FC0D98","s":"_ZN5mongo13CursorManager29eraseCursorGlobalIfAuthorizedEPNS_16OperationContextEiPKc"},{"b":"55CAA1D5A000","o":"C26FC5","s":"_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE"},{"b":"55CAA1D5A000","o":"C31FAA","s":"_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE"},{"b":"55CAA1D5A000","o":"C2D967","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE"},{"b":"55CAA1D5A000","o":"C30DA1"},{"b":"55CAA1D5A000","o":"1B337E2","s":"_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE"},{"b":"55CAA1D5A000","o":"C2C7CF","s":"_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE"},{"b":"55CAA1D5A000","o":"C2ED15","s":"_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE"},{"b":"55CAA1D5A000","o":"C2F60B","s":"_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE"},{"b":"55CAA1D5A000","o":"C2D9ED","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE"},{"b":"55CAA1D5A000","o":"C30DA1"},{"b":"55CAA1D5A000","o":"1B33D45"},{"b":"55CAA1D5A000","o":"20F15A4"},{"b":"7FBFB2997000","o":"7494"},{"b":"7FBFB25F8000","o":"E8ACF","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.7", "gitVersion" : "2628472127e9f1826e02c665c1d93880a204075e", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.112-ls-3", "version" : "#3 SMP Mon Feb 12 14:56:37 CET 2018", "machine" : "x86_64" }, "somap" : [ { "b" : "55CAA1D5A000", "elfType" : 3, "buildId" : "AA6EC882CEF443287FB9C7873C7DEC2643ED4BA9" }, { "b" : "7FFD85D79000", "path" : "linux-vdso.so.1", "elfType" : 3, "buildId" : "DEBF60FC8D64715B7432B531F79998366AB94138" }, { "b" : "7FBFB3BDA000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "713D47D5F599289C0A91ADE8F0122B2B4AA78B2E" }, { "b" : "7FBFB3747000", "path" : "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.1", "elfType" : 3, "buildId" : "2CFE882A331D7857E9CE1B5DE3255E6DA76EF899" }, { "b" : "7FBFB34DB000", "path" : "/usr/lib/x86_64-linux-gnu/libssl.so.1.1", "elfType" : 3, "buildId" : "E2AA3B39763D943F56B3BD05C8E36E639BA95E12" }, { "b" : "7FBFB32D7000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "B895F0831F623C5F23603401D4069F9F94C24761" }, { "b" : "7FBFB30CF000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "5D83E0642E645026DBB11F89F7DF7106BD821495" }, { "b" : "7FBFB2DCB000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1B95E3A8B8788B07E4F59EE69B1877F9DEB42033" }, { "b" : "7FBFB2BB4000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "51AD5FD294CD6C813BED40717347A53434B80B7A" }, { "b" : "7FBFB2997000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "4285CD3158DDE596765C747AE210AB6CBD258B22" }, { "b" : "7FBFB25F8000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "AA889E26A70F98FA8D230D088F7CC5BF43573163" }, { "b" : "7FBFB3DF1000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "263F909DBE11A66F7C6233E3FF0521148D9F8370" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x55caa3f91ff1]
 mongod(+0x2237209) [0x55caa3f91209]
 mongod(+0x2237876) [0x55caa3f91876]
 mongod(+0x131CCF1) [0x55caa3076cf1]
 libpthread.so.0(+0x110C0) [0x7fbfb29a80c0]
 mongod(__wt_btcur_reset+0x126) [0x55caa28e7666]
 mongod(+0xB366C1) [0x55caa28906c1]
 mongod(_ZN5mongo17WiredTigerSession13releaseCursorEmP11__wt_cursor+0x47) [0x55caa27b3d67]
 mongod(_ZN5mongo35WiredTigerRecordStoreStandardCursorD0Ev+0x25) [0x55caa27afb05]
 mongod(_ZN5mongo14CollectionScanD1Ev+0x30) [0x55caa2d388f0]
 mongod(_ZN5mongo14CollectionScanD0Ev+0x11) [0x55caa2d389b1]
 mongod(_ZN5mongo12PlanExecutorD1Ev+0x1A7) [0x55caa2d435c7]
 mongod(_ZN5mongo12PlanExecutor7DeleterclEPS0_+0x27) [0x55caa2d43967]
 mongod(_ZN5mongo12ClientCursorD1Ev+0x58) [0x55caa2d17748]
 mongod(_ZN5mongo13CursorManager11eraseCursorEPNS_16OperationContextExb+0x12C) [0x55caa2d19e7c]
 mongod(+0xFC09ED) [0x55caa2d1a9ed]
 mongod(_ZN5mongo13CursorManager29eraseCursorGlobalIfAuthorizedEPNS_16OperationContextEiPKc+0x48) [0x55caa2d1ad98]
 mongod(_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE+0x1805) [0x55caa2980fc5]
 mongod(_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE+0xBA) [0x55caa298bfaa]
 mongod(_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE+0x97) [0x55caa2987967]
 mongod(+0xC30DA1) [0x55caa298ada1]
 mongod(_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE+0x1A2) [0x55caa388d7e2]
 mongod(_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE+0x15F) [0x55caa29867cf]
 mongod(_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE+0xAF5) [0x55caa2988d15]
 mongod(_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE+0x23B) [0x55caa298960b]
 mongod(_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE+0x11D) [0x55caa29879ed]
 mongod(+0xC30DA1) [0x55caa298ada1]
 mongod(+0x1B33D45) [0x55caa388dd45]
 mongod(+0x20F15A4) [0x55caa3e4b5a4]
 libpthread.so.0(+0x7494) [0x7fbfb299e494]
 libc.so.6(clone+0x3F) [0x7fbfb26e0acf]
-----  END BACKTRACE  -----

{{}}



 Comments   
Comment by Sulabh Mahajan [ 11/Dec/19 ]

I had a look at this bug report. There seems to be a possibility that a swept dhandle was somehow accessed incorrectly. There is no additional information in the ticket to debug this issue. The log file unfortunately is not useful in this case. A core from the crash might have been helpful.

We have not seen this issue reported elsewhere. Because of the rare occurrence of this bug and the lack of information to debug the issue, I am inclined to close this one out for now. We will keep a watch on the internal testing and provide an update if we see this issue appear.

Please feel free to re-open the ticket if this crash happens again or you get more information on the original crash.

Comment by Louis Williams [ 17/Oct/19 ]

The failure is here, which is a macro that expands to:

#define S2BT(session) ((WT_BTREE *)(session)->dhandle->handle)

This was part of a killCursors command. 

I did the pointer arithmetic, and the 0x88 offset is too low to be the "dhandle" pointer on the "session", so I believe the invalid pointer is the "handle" pointer on the "dhandle". I wonder if a dhandle sweep invalided the handle on this session and resetting the cursor failed for that reason.

I'm going to send this to Storage Engines to take a look.

 

Comment by Louis Williams [ 17/Oct/19 ]

daniel.hatcher, the failures both seem related to our cursor caching mechanism, yet I can't say that they're the same problem.

Comment by Vaclav Bilek [ 03/Oct/19 ]

Its on different  physical HW, primary replicas of a sharded cluster. 

We dont have logs since start, the uploaded one is for the day it crashed.

Comment by Danny Hatcher (Inactive) [ 30/Sep/19 ]

jiri.sula@livesport.eu, as I mentioned on SERVER-43672, is this the same mongod that crashed or are they different? Is this a different process on the same underlying hardware? Please upload the full mongod logs from start (or at least a few days before) up through the crash for each ticket so we can take a closer look.

Comment by Jiri Sula [ 27/Sep/19 ]

I can't believe I mentioned WireShark, all those wired animals...weekend is coming. So there is no messages to blame storage engine yet, thus I didn't fill up Component. Thanks!

Generated at Thu Feb 08 05:03:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.