[SERVER-29155] "Random" Segmentation faults Created: 12/May/17  Updated: 29/Jan/18  Resolved: 10/Jun/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Ilja Assignee: Keith Bostic (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Sadly it seems pretty sporadic, so I can't provide steps to reproduce.

Participants:

 Description   

Our MongoDB crashes sporadicly in production, currently it happens every 2-3 weeks so we can't reliably reproduce the issue. After a database restart the issue is gone for the next week(s). This database is being accessed by potentially 5 other dedicated servers this are it's mongostats:

insert query update delete getmore command dirty used flushes vsize   res qrw arw net_in net_out conn                time
    *0   176  11486     *0       0    16|0  2.6% 5.0%       0 2.42G 1.63G 0|0 0|1  1.83m   17.9m  484 May 12 16:50:44.252
     2   105  16974     *0       0    18|0  2.6% 5.0%       0 2.42G 1.63G 0|0 0|0  2.66m   1.44m  484 May 12 16:50:45.252
    *0   144  18391     *0       0    12|0  2.6% 5.0%       0 2.42G 1.63G 0|0 1|0  2.89m   1.55m  485 May 12 16:50:46.253
    *0    35  21393     *0       0    12|0  2.6% 5.0%       0 2.42G 1.63G 0|0 0|0  3.34m   1.71m  485 May 12 16:50:47.252
    *0    23  17219     *0       1    13|0  2.6% 5.0%       0 2.42G 1.63G 0|0 1|0  2.69m   1.42m  485 May 12 16:50:48.252
     1    47  14004     *0       0    17|0  2.6% 5.0%       0 2.42G 1.63G 0|0 2|0  2.20m   17.9m  486 May 12 16:50:49.252
    *0    56  17955     *0       0    24|0  2.6% 5.0%       0 2.42G 1.63G 0|0 0|0  2.83m   1.47m  486 May 12 16:50:50.253

2017-05-12T15:50:33.727+0200 F -        [conn36561] Invalid access at address: 0x4000013a7
2017-05-12T15:50:33.735+0200 F -        [conn36561] Got signal: 11 (Segmentation fault).
 
 0x5654d14eeee1 0x5654d14edfd9 0x5654d14ee646 0x7ff8f8773370 0x5654d1e36c74 0x5654d1e02459 0x5654d1e52b88 0x5654d11ef44e 0x5654d0b4aa8b 0x5654d0af0e76 0x5654d0b12c63 0x5654d0ae4538 0x5654d0e199c2 0x5654d0e1bd68 0x5654d0e1ca1c 0x5654d0dd5ca2 0x5654d0dd685b 0x5654d09fdad8 0x5654d09d4867 0x5654d09d5c00 0x5654d0ff688d 0x5654d0bf3302 0x5654d0bf5276 0x5654d07f1f1d 0x5654d07f285d 0x5654d1454722 0x7ff8f876bdc5 0x7ff8f849a73d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"5654CFF71000","o":"157DEE1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"5654CFF71000","o":"157CFD9"},{"b":"5654CFF71000","o":"157D646"},{"b":"7FF8F8764000","o":"F370"},{"b":"5654CFF71000","o":"1EC5C74"},{"b":"5654CFF71000","o":"1E91459"},{"b":"5654CFF71000","o":"1EE1B88"},{"b":"5654CFF71000","o":"127E44E","s":"_ZN5mongo21WiredTigerRecordStore6Cursor9seekExactERKNS_8RecordIdE"},{"b":"5654CFF71000","o":"BD9A8B","s":"_ZN5mongo16WorkingSetCommon5fetchEPNS_16OperationContextEPNS_10WorkingSetEmNS_11unowned_ptrINS_20SeekableRecordCursorEEE"},{"b":"5654CFF71000","o":"B7FE76","s":"_ZN5mongo10FetchStage6doWorkEPm"},{"b":"5654CFF71000","o":"BA1C63","s":"_ZN5mongo9PlanStage4workEPm"},{"b":"5654CFF71000","o":"B73538","s":"_ZN5mongo15CachedPlanStage12pickBestPlanEPNS_15PlanYieldPolicyE"},{"b":"5654CFF71000","o":"EA89C2","s":"_ZN5mongo12PlanExecutor12pickBestPlanENS0_11YieldPolicyEPKNS_10CollectionE"},{"b":"5654CFF71000","o":"EAAD68","s":"_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_11YieldPolicyE"},{"b":"5654CFF71000","o":"EABA1C","s":"_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionENS0_11YieldPolicyE"},{"b":"5654CFF71000","o":"E64CA2","s":"_ZN5mongo11getExecutorEPNS_16OperationContextEPNS_10CollectionESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS5_EENS_12PlanExecutor11YieldPolicyEm"},{"b":"5654CFF71000","o":"E6585B","s":"_ZN5mongo15getExecutorFindEPNS_16OperationContextEPNS_10CollectionERKNS_15NamespaceStringESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS8_EENS_12PlanExecutor11YieldPolicyE"},{"b":"5654CFF71000","o":"A8CAD8","s":"_ZN5mongo7FindCmd3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_7BSONObjEiRS8_RNS_14BSONObjBuilderE"},{"b":"5654CFF71000","o":"A63867","s":"_ZN5mongo7Command3runEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS3_21ReplyBuilderInterfaceE"},{"b":"5654CFF71000","o":"A64C00","s":"_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_RKNS_3rpc16RequestInterfaceEPNS4_21ReplyBuilderInterfaceE"},{"b":"5654CFF71000","o":"108588D","s":"_ZN5mongo11runCommandsEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS2_21ReplyBuilderInterfaceE"},{"b":"5654CFF71000","o":"C82302"},{"b":"5654CFF71000","o":"C84276","s":"_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE"},{"b":"5654CFF71000","o":"880F1D","s":"_ZN5mongo23ServiceEntryPointMongod12_sessionLoopERKSt10shared_ptrINS_9transport7SessionEE"},{"b":"5654CFF71000","o":"88185D"},{"b":"5654CFF71000","o":"14E3722"},{"b":"7FF8F8764000","o":"7DC5"},{"b":"7FF8F83A3000","o":"F773D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.4", "gitVersion" : "888390515874a9debd1b6c5d36559ca86b44babd", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.30-mod-std-ipv6-64", "version" : "#9 SMP Tue Nov 1 17:58:26 CET 2016", "machine" : "x86_64" }, "somap" : [ { "b" : "5654CFF71000", "elfType" : 3, "buildId" : "BAB2B823978D8F6F84A5C63BFA4F54F3F06A65A4" }, { "b" : "7FFF133FA000", "elfType" : 3, "buildId" : "963A363281907ED5AFD9334B132661E125712016" }, { "b" : "7FF8F968E000", "path" : "/lib64/libssl.so.10", "elfType" : 3, "buildId" : "90EAF65D9B0EEEB1424241281F7F197451D4317D" }, { "b" : "7FF8F92A4000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "7278C69EE161D98DDD0FA00F92B67AD78C7B7F40" }, { "b" : "7FF8F909C000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "82E77ADE22BC9FFF8D3458BD37331E7EDF174C28" }, { "b" : "7FF8F8E98000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "C5F560504E1AF52E29679C3B52FF11121015D6BB" }, { "b" : "7FF8F8B96000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "721C7CC9488EFA25F83B48AF713AB27DBE48EF3E" }, { "b" : "7FF8F8980000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "408B46E291B2D4C9612E27C0509D165D7E186D40" }, { "b" : "7FF8F8764000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "C3DEB1FA27CD0C1C3CC575B944ABACBA0698B0F2" }, { "b" : "7FF8F83A3000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "8B2C421716985B927AA0CAF2A05D0B1F452367F7" }, { "b" : "7FF8F98FC000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "8F3E366E2DB73C330A3791DEAE31AE9579099B44" }, { "b" : "7FF8F8155000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "A2499C359AA179EE23324ED949C0E508E4434F10" }, { "b" : "7FF8F7E6E000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "E09A34D9083DC6FEAF7018C09D55631DEEE2836D" }, { "b" : "7FF8F7C6A000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "BF54B7C8932E450769FBBB8B18864D1DD70BBC67" }, { "b" : "7FF8F7A38000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "BF8F00D7CB849ADB0B7A4703BC7B8D66AEE6A49C" }, { "b" : "7FF8F7822000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "EA8E45DC8E395CC5E26890470112D97A1F1E0B65" }, { "b" : "7FF8F7613000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "1E7A92FDD6FB3871DA97F4BCA2E147E72B6B6E1F" }, { "b" : "7FF8F740F000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "2E01D5AC08C1280D013AAB96B292AC58BC30A263" }, { "b" : "7FF8F71F5000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "FE7AE845A123A3DFC0FDC2408BCBC2BA8B61B158" }, { "b" : "7FF8F6FCE000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "76687CA31A406854DF3BCF8D03055656F56E6892" }, { "b" : "7FF8F6D6D000", "path" : "/lib64/libpcre.so.1", "elfType" : 3, "buildId" : "AE64AA461A26E01F60408013D361749D56DD0AE1" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x5654d14eeee1]
 mongod(+0x157CFD9) [0x5654d14edfd9]
 mongod(+0x157D646) [0x5654d14ee646]
 libpthread.so.0(+0xF370) [0x7ff8f8773370]
 mongod(+0x1EC5C74) [0x5654d1e36c74]
 mongod(+0x1E91459) [0x5654d1e02459]
 mongod(+0x1EE1B88) [0x5654d1e52b88]
 mongod(_ZN5mongo21WiredTigerRecordStore6Cursor9seekExactERKNS_8RecordIdE+0x4E) [0x5654d11ef44e]
 mongod(_ZN5mongo16WorkingSetCommon5fetchEPNS_16OperationContextEPNS_10WorkingSetEmNS_11unowned_ptrINS_20SeekableRecordCursorEEE+0xAB) [0x5654d0b4aa8b]
 mongod(_ZN5mongo10FetchStage6doWorkEPm+0x106) [0x5654d0af0e76]
 mongod(_ZN5mongo9PlanStage4workEPm+0x63) [0x5654d0b12c63]
 mongod(_ZN5mongo15CachedPlanStage12pickBestPlanEPNS_15PlanYieldPolicyE+0x198) [0x5654d0ae4538]
 mongod(_ZN5mongo12PlanExecutor12pickBestPlanENS0_11YieldPolicyEPKNS_10CollectionE+0xF2) [0x5654d0e199c2]
 mongod(_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_11YieldPolicyE+0x2D8) [0x5654d0e1bd68]
 mongod(_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionENS0_11YieldPolicyE+0xEC) [0x5654d0e1ca1c]
 mongod(_ZN5mongo11getExecutorEPNS_16OperationContextEPNS_10CollectionESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS5_EENS_12PlanExecutor11YieldPolicyEm+0x132) [0x5654d0dd5ca2]
 mongod(_ZN5mongo15getExecutorFindEPNS_16OperationContextEPNS_10CollectionERKNS_15NamespaceStringESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS8_EENS_12PlanExecutor11YieldPolicyE+0x8B) [0x5654d0dd685b]
 mongod(_ZN5mongo7FindCmd3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_7BSONObjEiRS8_RNS_14BSONObjBuilderE+0xC98) [0x5654d09fdad8]
 mongod(_ZN5mongo7Command3runEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS3_21ReplyBuilderInterfaceE+0x527) [0x5654d09d4867]
 mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_RKNS_3rpc16RequestInterfaceEPNS4_21ReplyBuilderInterfaceE+0x9E0) [0x5654d09d5c00]
 mongod(_ZN5mongo11runCommandsEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS2_21ReplyBuilderInterfaceE+0x26D) [0x5654d0ff688d]
 mongod(+0xC82302) [0x5654d0bf3302]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x6D6) [0x5654d0bf5276]
 mongod(_ZN5mongo23ServiceEntryPointMongod12_sessionLoopERKSt10shared_ptrINS_9transport7SessionEE+0x1ED) [0x5654d07f1f1d]
 mongod(+0x88185D) [0x5654d07f285d]
 mongod(+0x14E3722) [0x5654d1454722]
 libpthread.so.0(+0x7DC5) [0x7ff8f876bdc5]
 libc.so.6(clone+0x6D) [0x7ff8f849a73d]
-----  END BACKTRACE  -----



 Comments   
Comment by Keith Bostic (Inactive) [ 10/Jun/17 ]

That's great news irwks, thanks for letting us know!

Comment by Ilja [ 10/Jun/17 ]

Hey Keith,

sorry for not getting back to this issue, as suggested we migrated to another system with totally different hardware. Since this migration we haven't experienced this issue (running for one week now). Our further investigation resulted in a potential thermal problem at our hosting providers data center, since the first migration was performed on a hardware vice identical machine and in some logs we found messages about the cpus being throttled to a lower tick rate in reason of temperature It seems that the segmentation faults occurred right after this temperature incidents.

Comment by Keith Bostic (Inactive) [ 09/Jun/17 ]

Hi irwks, just wanted to make sure we don't lose track of this problem, are you still investigating on your side?

Comment by Keith Bostic (Inactive) [ 03/Jun/17 ]

Hi irwks, I'm sorry to see that you're still having problems.

I've been staring at this one today, and obviously, it's similar to the last failure you saw. Unfortunately, the cause looks the same: it's a segment fault in an extremely heavily-used code path (key lookup in the underlying Btree), and it's a unique failure for the MongoDB release, we haven't had any other customer or user experience segment faults in this path.

The SEGV address is 0x400000153, that is, the low bit is set, and here's the code where it appears we're failing:

    708 #define WT_IK_FLAG                      0x01
    709 #define WT_IK_ENCODE_KEY_LEN(v)         ((uintptr_t)(v) << 32)
    710 #define WT_IK_DECODE_KEY_LEN(v)         ((v) >> 32)
    711 #define WT_IK_ENCODE_KEY_OFFSET(v)      ((uintptr_t)(v) << 1)
    712 #define WT_IK_DECODE_KEY_OFFSET(v)      (((v) & 0xFFFFFFFF) >> 1)
    713         v = (uintptr_t)ref->ref_ikey;
    714         if (v & WT_IK_FLAG) {
    715                 *(void **)keyp =
    716                     WT_PAGE_REF_OFFSET(page, WT_IK_DECODE_KEY_OFFSET(v));
    717                 *sizep = WT_IK_DECODE_KEY_LEN(v);
    718         } else {
    719                 *(void **)keyp = WT_IKEY_DATA(ref->ref_ikey);
    720                 *sizep = ((WT_IKEY *)ref->ref_ikey)->size;                 <<<<<<<< reported stack failure.
    721         }

In this line, *sizep is points to an address in the cursor that we're using all the time, so I'm expecting ref or ref->ref_ikey are the problems. We indirected through ref at line 713, and we checked for a low-order bit set in ref->ref_ikey at line 714, which means we shouldn't be arrive at line 720 and attempt to indirect through 0x400000153.

I'll have someone else review my analysis to make sure I'm not missing something, but this sure looks like memory corruption to me. It's potentially something else (perhaps a wild pointer from some other part of the code, which might be workload based), but absent a reproducible test case, we're going to struggle to debug that.

Have you seen any more failures since this report?

Is there any chance your provider migrated you back to the previous system?

Is there anything the two failing systems have in common that might be relevant? It's unlikely a disk subsystem failure would lead to this kind of failure.

Comment by Ilja [ 27/May/17 ]

Hello again,

as we all know, weekend days are the famous "failure-in-production-days", so sadly I have to update this issue. Our MongoDB instance failed today again on the new machine after running for eight days without any issues. We uploaded the diagnostic data and the mongod.log on the given portal.

Thank you,
Ilja

Comment by Kelsey Schubert [ 22/May/17 ]

Hi irwks,

Thanks for confirming that changing the hardware resolved the issue. Please feel free to reference this ticket. Besides memory corruption, another possibility is that the filesystem had a bad block.

Thanks again,
Thomas

Comment by Ilja [ 21/May/17 ]

Hello Thomas,

thank you very much for your work. We moved our database to another hardware instance since then everything kept running smoothly, even under a more heavy load which our network is currently experiencing since the one dedicated root server ist missing. If this is fine for you, I will refer to this issue when talking to our hosting provider since the RAM-check on the mentioned server did not return any errors. I will also keep this issue updated if something happens in the case.

Thank you!

Comment by Kelsey Schubert [ 19/May/17 ]

Hi irwks,

We've analyzed the provided stack traces. This behavior appears to be the result of fault hardware, specifically memory corruption would most likely explain these segmentation faults. Would you move this mongod instance to a host with new hardware and confirm the that issue is resolved?

Thank you,
Thomas

Comment by Ilja [ 17/May/17 ]

Brief Update: After running smoothly for the last 3 days our system is now failing in production right after restart and and running a couple of queries. I just uploaded the current mongod.log file, the diagnostic data and the matching syslog snippet.

Comment by Ilja [ 15/May/17 ]

Alright, @keith.bostic. To clear out any potential OS installation issues I reinstalled the host completely from scratch and only restored a database backup. If it happens again, I will upload all the needed diagnostic data.

Comment by Keith Bostic (Inactive) [ 15/May/17 ]

Thank you, ramon.fernandez.

irwks, I haven't given up yet, but so far nothing here is pointing to the problem.

Since you're seeing repeated failures, would you upload the same information for your next failure, as you did for this one? Thank you!

Comment by Ramon Fernandez Marina [ 15/May/17 ]

Unfortunately the uploaded logs don't have any additional information that can help us here. We'll let you know if there's other information you may collect that can give us a clue here – thanks in advance for your continued patience.

Regards,
Ramón.

Comment by Ilja [ 12/May/17 ]

Of course, @Keith Bostic, I just uploaded the relevant part of the messages.log file via the given upload portal.

Comment by Keith Bostic (Inactive) [ 12/May/17 ]

irwks, if more failures occur, would you please get us copies of those logs, too? And, could we please have copies of the system diagnostic data (/var/log/messages) covering the time period of this failure, and, of course, any future failures.

Thank you!

Comment by Ilja [ 12/May/17 ]

Hello Thomas,

thank you for your fast response. I have uploaded a zip archive of all requested data, beware that the files are pretty big so the interesting part of mongod.log starts at line 1265526.

Thanks in advance,

Ilja

Comment by Kelsey Schubert [ 12/May/17 ]

Hi irwks,

Would you please upload an archive of the diagnostic.data directory as well as the complete log files of the affected mongod?

I've created a secure upload portal for you to provide these files. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time.

Thank you,
Thomas

Generated at Thu Feb 08 04:20:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.