[SERVER-37655] Segmentation Fault on 3.6.6 Created: 17/Oct/18  Updated: 29/Jul/21  Resolved: 17/Jan/19

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Devon Yang Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:
Case:

 Description   

We hit a segmentation vault on version 3.6.6 today on our primary. I'll give some additional context which may or may not be relevant to the error.

Recently we have been running up against the connection limit due to a OS limit we haven't lifted yet (~32000), which was causing new connections to fail. This was happening and we made a change to reduce the number of connections to the primary (~18000), which was working and then the primary seg faulted.

We have three replicas in this cluster, and one replica was already down for another issue (certificate expiration). I assume this prevented re-election causing our cluster to be unavailable until we manually cycled this replica by restarting the process.

This issue is around the segfault error itself, but I am also curious why the process did not crash itself (it was left hanging with a single core pegged–not sure what it was doing) which would have auto-restarted in our system.

2018-10-17T18:55:43.278+0000 F - [listener] Got signal: 11 (Segmentation fault).

0x56036294d8b1 0x56036294cac9 0x56036294d136 0x7f9764421390 0x7f9764417e8f 0x5603628016db 0x56036223c9ff 0x56036132709f 0x56036132789a 0x5603613256f1 0x5603624853b2 0x5603624915c9 0x560362491811 0x56036249ba5e 0x56036248377e 0x560362a5ce80 0x7f97644176ba 0x7f976414d41d
----- BEGIN BACKTRACE -----

{"backtrace":[\{"b":"5603606FA000","o":"22538B1","s":"_ZN5mongo15printStackTraceERSo"}

,{"b":"5603606FA000","o":"2252AC9"},{"b":"5603606FA000","o":"2253136"},{"b":"7F9764410000","o":"11390"},{"b":"7F9764410000","o":"7E8F","s":"pthread_create"},{"b":"5603606FA000","o":"21076DB","s":"_ZN5mongo25launchServiceWorkerThreadESt8functionIFvvEE"},{"b":"5603606FA000","o":"1B429FF","s":"_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE"},{"b":"5603606FA000","o":"C2D09F","s":"_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE"},{"b":"5603606FA000","o":"C2D89A","s":"_ZN5mongo19ServiceStateMachine5startENS0_9OwnershipE"},{"b":"5603606FA000","o":"C2B6F1","s":"_ZN5mongo21ServiceEntryPointImpl12startSessionESt10shared_ptrINS_9transport7SessionEE"},{"b":"5603606FA000","o":"1D8B3B2"},{"b":"5603606FA000","o":"1D975C9","s":"_ZN4asio6detail9scheduler10do_run_oneERNS0_27conditionally_enabled_mutex11scoped_lockERNS0_21scheduler_thread_infoERKSt10error_code"},{"b":"5603606FA000","o":"1D97811","s":"_ZN4asio6detail9scheduler3runERSt10error_code"},{"b":"5603606FA000","o":"1DA1A5E","s":"_ZN4asio10io_context3runEv"},{"b":"5603606FA000","o":"1D8977E"},{"b":"5603606FA000","o":"2362E80"},{"b":"7F9764410000","o":"76BA"},{"b":"7F9764046000","o":"10741D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.6", "gitVersion" : "6405d65b1d6432e138b44c13085d0c2fe235d6bd", "compiledModules" : [], "uname" :

{ "sysname" : "Linux", "release" : "4.4.0-1049-aws", "version" : "#58-Ubuntu SMP Fri Jan 12 23:17:09 UTC 2018", "machine" : "x86_64" }

, "somap" : [ { "b" : "5603606FA000", "elfType" : 3, "buildId" : "F63278FD698B5843222FE7A6C8FF17D6AEFBBE38" }, { "b" : "7FFF7935A000", "elfType" : 3, "buildId" : "3A8AFEDA6CA80FBF2589D7E5803A58BA8F13FE62" }, { "b" : "7F9765605000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "6EF73266978476EF9F2FD2CF31E57F4597CB74F8" }, { "b" : "7F97651C1000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "250E875F74377DFC74DE48BF80CCB237BB4EFF1D" }, { "b" : "7F9764F58000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "513282AC7EB386E2C0133FD9E1B6B8A0F38B047D" }, { "b" : "7F9764D54000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "8CC8D0D119B142D839800BFF71FB71E73AEA7BD4" }, { "b" : "7F9764B4C000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "89C34D7A182387D76D5CDA1F7718F5D58824DFB3" }, { "b" : "7F9764843000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "DFB85DE42DAFFD09640C8FE377D572DE3E168920" }, { "b" : "7F976462D000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "68220AE2C65D65C1B6AAA12FA6765A6EC2F5F434" }, { "b" : "7F9764410000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "CE17E023542265FC11D9BC8F534BB4F070493D30" }, { "b" : "7F9764046000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "B5381A457906D279073822A5CEB24C4BFEF94DDB" }, { "b" : "7F9765820000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "5D7B6259552275A3C17BD4C3FD05F5A6BF40CAA5" } ] }}
mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x56036294d8b1]
mongod(+0x2252AC9) [0x56036294cac9]
mongod(+0x2253136) [0x56036294d136]
libpthread.so.0(+0x11390) [0x7f9764421390]
libpthread.so.0(pthread_create+0x4FF) [0x7f9764417e8f]
mongod(_ZN5mongo25launchServiceWorkerThreadESt8functionIFvvEE+0xDB) [0x5603628016db]
mongod(_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE+0x2FF) [0x56036223c9ff]
mongod(_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE+0x15F) [0x56036132709f]
mongod(_ZN5mongo19ServiceStateMachine5startENS0_9OwnershipE+0x13A) [0x56036132789a]
mongod(_ZN5mongo21ServiceEntryPointImpl12startSessionESt10shared_ptrINS_9transport7SessionEE+0x881) [0x5603613256f1]
mongod(+0x1D8B3B2) [0x5603624853b2]
mongod(_ZN4asio6detail9scheduler10do_run_oneERNS0_27conditionally_enabled_mutex11scoped_lockERNS0_21scheduler_thread_infoERKSt10error_code+0x389) [0x5603624915c9]
mongod(_ZN4asio6detail9scheduler3runERSt10error_code+0xD1) [0x560362491811]
mongod(_ZN4asio10io_context3runEv+0x3E) [0x56036249ba5e]
mongod(+0x1D8977E) [0x56036248377e]
mongod(+0x2362E80) [0x560362a5ce80]
libpthread.so.0(+0x76BA) [0x7f97644176ba]
libc.so.6(clone+0x6D) [0x7f976414d41d]
----- END BACKTRACE -----



 Comments   
Comment by Kelsey Schubert [ 17/Jan/19 ]

Hi devony,

Using this information I was able to parse out the stacks in libpthread. It appears that libpthread was in the process of propagating an ENOMEM signal back to mongod when it attempted to access uninitialized memory resulting in the observed segfault. In this case, the error occurs beneath mongod and there is little that mongod can do to better handle this error. The solution, as you've found, is to set appropriate kernel/user limits.

Since there's nothing more for us to do under this ticket, I'm going to resolve it. Thank you for your help tracking down the root cause of the issue.

Kind regards,
Kelsey

Comment by Devon Yang [ 06/Dec/18 ]

Hi Kelsey,

 

Sorry for the delay–here is the output of that command:

/sbin/ldconfig.real: Path `/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: Path `/usr/lib/x86_64-linux-gnu' given more than once
/sbin/ldconfig.real: /lib/x86_64-linux-gnu/ld-2.23.so is the dynamic linker, ignoring

libpthread.so.0 -> libpthread-2.23.so

Comment by Kelsey Schubert [ 20/Nov/18 ]

Hi devony,

We'd like to take a closer look at calls and source code inside the shared library, libpthread.so.0. So we can continue to investigate, would you please provide version that mongod is linking against? I believe the output of following command should be sufficient:

ldconfig -v | grep libpthread

Thank you,
Kelsey

Comment by Devon Yang [ 23/Oct/18 ]

If there is anything specific in the mongod.log you want me to check for I can help with that as well. Let me know if there's anything else you need.

Comment by Devon Yang [ 20/Oct/18 ]

I've uploaded a zip with the contents of the directory.

Comment by Devon Yang [ 19/Oct/18 ]

$ sysctl vm.max_map_count
vm.max_map_count = 65530

 

So it looks like that was the culprit of our limit–thanks! I will get you the archive later today.

Comment by Bruce Lucas (Inactive) [ 19/Oct/18 ]

devony, can you also show us the output of sysctl vm.max_map_count? This is another limit that affects the number of connections that can be handled. Each connection requires a thread and each thread requires two mapped memory segments for the stack, so this number needs to be at least twice the number of connections that you want to handle (plus some additional amount to account for memory segments mapped for other purposes, such as the memory allocator heap). I don't know if this is the cause of the segfault, but having that information will help us as we consider possible theories.

Also, can you please archive and upload to the secure upload portal the content of $dbpath/diagnostic.data from the affected node? The information recorded in this directory is primarily serverStatus at 1-second intervals, and that will give us a clearer picture of resource utilization leading up to the segfault. There is some urgency to this as the data ages out after about a week.

Thanks,
Bruce

Comment by Devon Yang [ 18/Oct/18 ]

Here is the output of our limits currently–should be the same as what was previous running. Let me check with our security team if those logs can be shared–I suspect not though.

 

{{Limit Soft Limit Hard Limit Units }}
{{Max cpu time unlimited unlimited seconds }}
{{Max file size unlimited unlimited bytes }}
{{Max data size unlimited unlimited bytes }}
{{Max stack size 8388608 unlimited bytes }}
{{Max core file size 0 unlimited bytes }}
{{Max resident set unlimited unlimited bytes }}
{{Max processes 256000 256000 processes }}
{{Max open files 256000 256000 files }}
{{Max locked memory 65536 65536 bytes }}
{{Max address space unlimited unlimited bytes }}
{{Max file locks unlimited unlimited locks }}
{{Max pending signals 491450 491450 signals }}
{{Max msgqueue size 819200 819200 bytes }}
{{Max nice priority 0 0 }}
{{Max realtime priority 0 0 }}
{{Max realtime timeout unlimited unlimited us }}

 

 

Comment by Dmitry Agranat [ 18/Oct/18 ]

Hi devony,

Thanks for your report. So that we can better understand what happened and answer your specific questions, please provide the following:

  • mongod log covering about 10 minutes before the reported segmentation fault and until the process was restarted. Since this is a public project, you can upload this information to our secure upload portal if you'd prefer. Information that you share there is only visible to MongoDB employees, and it is automatically removed after a period of time.
  • the output of cat /proc/PID/limits (replacing "PID" with the pid for the mongod) for the mongod instance on which this error occurred so that we can confirm all of the ulimits.

Thanks,
Dima

Generated at Thu Feb 08 04:46:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.