[SERVER-37655] Segmentation Fault on 3.6.6 Created: 17/Oct/18 Updated: 29/Jul/21 Resolved: 17/Jan/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Devon Yang | Assignee: | Kelsey Schubert |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: | |
| Case: | (copied to CRM) |
| Description |
|
We hit a segmentation vault on version 3.6.6 today on our primary. I'll give some additional context which may or may not be relevant to the error. Recently we have been running up against the connection limit due to a OS limit we haven't lifted yet (~32000), which was causing new connections to fail. This was happening and we made a change to reduce the number of connections to the primary (~18000), which was working and then the primary seg faulted. We have three replicas in this cluster, and one replica was already down for another issue (certificate expiration). I assume this prevented re-election causing our cluster to be unavailable until we manually cycled this replica by restarting the process. This issue is around the segfault error itself, but I am also curious why the process did not crash itself (it was left hanging with a single core pegged–not sure what it was doing) which would have auto-restarted in our system. 2018-10-17T18:55:43.278+0000 F - [listener] Got signal: 11 (Segmentation fault). 0x56036294d8b1 0x56036294cac9 0x56036294d136 0x7f9764421390 0x7f9764417e8f 0x5603628016db 0x56036223c9ff 0x56036132709f 0x56036132789a 0x5603613256f1 0x5603624853b2 0x5603624915c9 0x560362491811 0x56036249ba5e 0x56036248377e 0x560362a5ce80 0x7f97644176ba 0x7f976414d41d ,{"b":"5603606FA000","o":"2252AC9"},{"b":"5603606FA000","o":"2253136"},{"b":"7F9764410000","o":"11390"},{"b":"7F9764410000","o":"7E8F","s":"pthread_create"},{"b":"5603606FA000","o":"21076DB","s":"_ZN5mongo25launchServiceWorkerThreadESt8functionIFvvEE"},{"b":"5603606FA000","o":"1B429FF","s":"_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE"},{"b":"5603606FA000","o":"C2D09F","s":"_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE"},{"b":"5603606FA000","o":"C2D89A","s":"_ZN5mongo19ServiceStateMachine5startENS0_9OwnershipE"},{"b":"5603606FA000","o":"C2B6F1","s":"_ZN5mongo21ServiceEntryPointImpl12startSessionESt10shared_ptrINS_9transport7SessionEE"},{"b":"5603606FA000","o":"1D8B3B2"},{"b":"5603606FA000","o":"1D975C9","s":"_ZN4asio6detail9scheduler10do_run_oneERNS0_27conditionally_enabled_mutex11scoped_lockERNS0_21scheduler_thread_infoERKSt10error_code"},{"b":"5603606FA000","o":"1D97811","s":"_ZN4asio6detail9scheduler3runERSt10error_code"},{"b":"5603606FA000","o":"1DA1A5E","s":"_ZN4asio10io_context3runEv"},{"b":"5603606FA000","o":"1D8977E"},{"b":"5603606FA000","o":"2362E80"},{"b":"7F9764410000","o":"76BA"},{"b":"7F9764046000","o":"10741D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.6", "gitVersion" : "6405d65b1d6432e138b44c13085d0c2fe235d6bd", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.0-1049-aws", "version" : "#58-Ubuntu SMP Fri Jan 12 23:17:09 UTC 2018", "machine" : "x86_64" }, "somap" : [ { "b" : "5603606FA000", "elfType" : 3, "buildId" : "F63278FD698B5843222FE7A6C8FF17D6AEFBBE38" }, { "b" : "7FFF7935A000", "elfType" : 3, "buildId" : "3A8AFEDA6CA80FBF2589D7E5803A58BA8F13FE62" }, { "b" : "7F9765605000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "6EF73266978476EF9F2FD2CF31E57F4597CB74F8" }, { "b" : "7F97651C1000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "250E875F74377DFC74DE48BF80CCB237BB4EFF1D" }, { "b" : "7F9764F58000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "513282AC7EB386E2C0133FD9E1B6B8A0F38B047D" }, { "b" : "7F9764D54000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "8CC8D0D119B142D839800BFF71FB71E73AEA7BD4" }, { "b" : "7F9764B4C000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "89C34D7A182387D76D5CDA1F7718F5D58824DFB3" }, { "b" : "7F9764843000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "DFB85DE42DAFFD09640C8FE377D572DE3E168920" }, { "b" : "7F976462D000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "68220AE2C65D65C1B6AAA12FA6765A6EC2F5F434" }, { "b" : "7F9764410000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "CE17E023542265FC11D9BC8F534BB4F070493D30" }, { "b" : "7F9764046000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "B5381A457906D279073822A5CEB24C4BFEF94DDB" }, { "b" : "7F9765820000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "5D7B6259552275A3C17BD4C3FD05F5A6BF40CAA5" } ] }} |
| Comments |
| Comment by Kelsey Schubert [ 17/Jan/19 ] | |
|
Hi devony, Using this information I was able to parse out the stacks in libpthread. It appears that libpthread was in the process of propagating an ENOMEM signal back to mongod when it attempted to access uninitialized memory resulting in the observed segfault. In this case, the error occurs beneath mongod and there is little that mongod can do to better handle this error. The solution, as you've found, is to set appropriate kernel/user limits. Since there's nothing more for us to do under this ticket, I'm going to resolve it. Thank you for your help tracking down the root cause of the issue. Kind regards, | |
| Comment by Devon Yang [ 06/Dec/18 ] | |
|
Hi Kelsey,
Sorry for the delay–here is the output of that command: /sbin/ldconfig.real: Path `/lib/x86_64-linux-gnu' given more than once libpthread.so.0 -> libpthread-2.23.so | |
| Comment by Kelsey Schubert [ 20/Nov/18 ] | |
|
Hi devony, We'd like to take a closer look at calls and source code inside the shared library, libpthread.so.0. So we can continue to investigate, would you please provide version that mongod is linking against? I believe the output of following command should be sufficient:
Thank you, | |
| Comment by Devon Yang [ 23/Oct/18 ] | |
|
If there is anything specific in the mongod.log you want me to check for I can help with that as well. Let me know if there's anything else you need. | |
| Comment by Devon Yang [ 20/Oct/18 ] | |
|
I've uploaded a zip with the contents of the directory. | |
| Comment by Devon Yang [ 19/Oct/18 ] | |
|
$ sysctl vm.max_map_count
So it looks like that was the culprit of our limit–thanks! I will get you the archive later today. | |
| Comment by Bruce Lucas (Inactive) [ 19/Oct/18 ] | |
|
devony, can you also show us the output of sysctl vm.max_map_count? This is another limit that affects the number of connections that can be handled. Each connection requires a thread and each thread requires two mapped memory segments for the stack, so this number needs to be at least twice the number of connections that you want to handle (plus some additional amount to account for memory segments mapped for other purposes, such as the memory allocator heap). I don't know if this is the cause of the segfault, but having that information will help us as we consider possible theories. Also, can you please archive and upload to the secure upload portal the content of $dbpath/diagnostic.data from the affected node? The information recorded in this directory is primarily serverStatus at 1-second intervals, and that will give us a clearer picture of resource utilization leading up to the segfault. There is some urgency to this as the data ages out after about a week. Thanks, | |
| Comment by Devon Yang [ 18/Oct/18 ] | |
|
Here is the output of our limits currently–should be the same as what was previous running. Let me check with our security team if those logs can be shared–I suspect not though.
{{Limit Soft Limit Hard Limit Units }}
| |
| Comment by Dmitry Agranat [ 18/Oct/18 ] | |
|
Hi devony, Thanks for your report. So that we can better understand what happened and answer your specific questions, please provide the following:
Thanks, |