[SERVER-33392] Server terminates when unable to create worker thread Created: 20/Feb/18  Updated: 27/Oct/23  Resolved: 03/Jan/19

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 3.2.0
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Kevin Pulo Assignee: ADAM Martin (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Sprint: Service Arch 2018-11-05, Service Arch 2018-11-19, Service Arch 2018-12-03, Service Arch 2018-12-17, Service Arch 2018-12-31, Service Arch 2019-01-14
Participants:
Case:

 Description   

When the server is unable to create a thread, the outcome depends on the context:

  • Thread for incoming connection: server logs the failure, then closes the connection and continues

    2018-02-16T10:39:32.289-0800 I NETWORK  [initandlisten] connection accepted from 10.x.x.x:xxxxx #94318774 (31890 connections now open)
    2018-02-16T10:39:32.289-0800 I NETWORK  [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable
    2018-02-16T10:39:32.291-0800 I NETWORK  [initandlisten] failed to create thread after accepting new connection, closing connection
    

  • Worker thread: server logs the failure, then terminates

    2018-02-16T10:40:17.302-0800 F -        [NetworkInterfaceASIO-BGSync-0] std::exception::what(): Resource temporarily unavailable
    Actual exception type: std::system_error
     
     0x1351ff2 0x1351b42 0x1b37646 0x1b37673 0x12df774 0x12dfcc8 0x112011e 0x11209fe 0x112114c 0x111423f 0x11096bd 0x110a2da 0x110a8d8 0x1108070 0x10db4e0 0x10e99bc 0x10e9e78 0x136e5f1 0x136e811 0x1101d3f 0x1b7f610 0x7f7e80f8d184 0x7f7e80cba03d
    ----- BEGIN BACKTRACE -----
    {"backtrace":[{"b":"400000","o":"F51FF2","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"F51B42"},{"b":"400000","o":"1737646","s":"_ZN10__cxxabiv111__terminateEPFvvE"},{"b":"400000","o":"1737673"},{"b":"400000","o":"EDF774","s":"_ZN5mongo10ThreadPool25_startWorkerThread_inlockEv"},{"b":"400000","o":"EDFCC8","s":"_ZN5mongo10ThreadPool8scheduleESt8functionIFvvEE"},{"b":"400000","o":"D2011E","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPSt4listISt10shared_ptrINS1_13CallbackStateEESaIS5_EERKSt14_List_iteratorIS5_ESC_St11unique_lockISt5mutexE"},{"b":"400000","o":"D209FE","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPSt4listISt10shared_ptrINS1_13CallbackStateEESaIS5_EERKSt14_List_iteratorIS5_ESt11unique_lockISt5mutexE"},{"b":"400000","o":"D2114C"},{"b":"400000","o":"D1423F","s":"_ZN5mongo8executor20NetworkInterfaceASIO7AsyncOp6finishERKNS_10StatusWithINS0_21RemoteCommandResponseEEE"},{"b":"400000","o":"D096BD","s":"_ZN5mongo8executor20NetworkInterfaceASIO18_completeOperationEPNS1_7AsyncOpERKNS_10StatusWithINS0_21RemoteCommandResponseEEE"},{"b":"400000","o":"D0A2DA","s":"_ZN5mongo8executor20NetworkInterfaceASIO20_completedOpCallbackEPNS1_7AsyncOpE"},{"b":"400000","o":"D0A8D8"},{"b":"400000","o":"D08070"},{"b":"400000","o":"CDB4E0","s":"_ZN4asio6detail14strand_service8dispatchINS0_7binder2IRSt8functionIFvSt10error_codemEES5_mEEEEvRPNS1_11strand_implERT_"},{"b":"400000","o":"CE99BC","s":"_ZN4asio6detail14strand_service8dispatchINS0_17rewrapped_handlerINS0_7binder2INS0_7read_opINS_19basic_stream_socketINS_2ip3tcpENS_21stream_socket_serviceIS8_EEEENS_17mutable_buffers_1ENS0_14transfer_all_tENS0_15wrapped_handlerINS_10io_service6strandESt8functionIFvSt10error_codemEENS0_26is_continuation_if_runningEEEEESI_mEESK_EEEEvRPNS1_11strand_implERT_"},{"b":"400000","o":"CE9E78","s":"_ZN4asio6detail23reactive_socket_recv_opINS_17mutable_buffers_1ENS0_7read_opINS_19basic_stream_socketINS_2ip3tcpENS_21stream_socket_serviceIS6_EEEES2_NS0_14transfer_all_tENS0_15wrapped_handlerINS_10io_service6strandESt8functionIFvSt10error_codemEENS0_26is_continuation_if_runningEEEEEE11do_completeEPvPNS0_19scheduler_operationERKSF_m"},{"b":"400000","o":"F6E5F1","s":"_ZN4asio6detail9scheduler10do_run_oneERNS0_11scoped_lockINS0_11posix_mutexEEERNS0_21scheduler_thread_infoERKSt10error_code"},{"b":"400000","o":"F6E811","s":"_ZN4asio6detail9scheduler3runERSt10error_code"},{"b":"400000","o":"D01D3F"},{"b":"400000","o":"177F610","s":"execute_native_thread_routine"},{"b":"7F7E80F85000","o":"8184"},{"b":"7F7E80BBC000","o":"FE03D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.16", "gitVersion" : "056bf45128114e44c5358c7a8776fb582363e094", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.0-79-generic", "version" : "#100~14.04.1-Ubuntu SMP Fri May 19 18:36:51 UTC 2017", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "08B7F4039582A49C92A2B9D92929EF6B690B4F4A" }, { "b" : "7FFFEF77E000", "elfType" : 3, "buildId" : "3449FF93C74CB63856A9BE01B606A0BB1DE26BE3" }, { "b" : "7F7E81EA7000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "1287BAA0C3440FDF4F9A5AB267445129A9DBD14E" }, { "b" : "7F7E81ACB000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "3F882E7949FA0CB52422985A88CDD7E6182CBD70" }, { "b" : "7F7E818C3000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "4F930712D3609C93E380E5BE5DF73E7AD273531C" }, { "b" : "7F7E816BF000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "034D6A4EE9DCAB4A34ABD644345CBBB42DC63088" }, { "b" : "7F7E813B9000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "300C7884CDEB5667BEA2357D2B8E7A76397562D6" }, { "b" : "7F7E811A3000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "36311B4457710AE5578C4BF00791DED7359DBB92" }, { "b" : "7F7E80F85000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "F64B8AD471FBA1B7A3A64EFB01551E694975E1F7" }, { "b" : "7F7E80BBC000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "D9A10B8EF90300628DD0A3A535106967714D7328" }, { "b" : "7F7E82106000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "2CA513EDC89C7BC06EC183D1A3A03CC0F606319C" } ] }}
     mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x1351ff2]
     mongod(+0xF51B42) [0x1351b42]
     mongod(_ZN10__cxxabiv111__terminateEPFvvE+0x6) [0x1b37646]
     mongod(+0x1737673) [0x1b37673]
     mongod(_ZN5mongo10ThreadPool25_startWorkerThread_inlockEv+0xA34) [0x12df774]
     mongod(_ZN5mongo10ThreadPool8scheduleESt8functionIFvvEE+0x348) [0x12dfcc8]
     mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPSt4listISt10shared_ptrINS1_13CallbackStateEESaIS5_EERKSt14_List_iteratorIS5_ESC_St11unique_lockISt5mutexE+0x2AE) [0x112011e]
     mongod(_ZN5mongo8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPSt4listISt10shared_ptrINS1_13CallbackStateEESaIS5_EERKSt14_List_iteratorIS5_ESt11unique_lockISt5mutexE+0x3E) [0x11209fe]
     mongod(+0xD2114C) [0x112114c]
     mongod(_ZN5mongo8executor20NetworkInterfaceASIO7AsyncOp6finishERKNS_10StatusWithINS0_21RemoteCommandResponseEEE+0x14F) [0x111423f]
     mongod(_ZN5mongo8executor20NetworkInterfaceASIO18_completeOperationEPNS1_7AsyncOpERKNS_10StatusWithINS0_21RemoteCommandResponseEEE+0x35D) [0x11096bd]
     mongod(_ZN5mongo8executor20NetworkInterfaceASIO20_completedOpCallbackEPNS1_7AsyncOpE+0x6A) [0x110a2da]
     mongod(+0xD0A8D8) [0x110a8d8]
     mongod(+0xD08070) [0x1108070]
     mongod(_ZN4asio6detail14strand_service8dispatchINS0_7binder2IRSt8functionIFvSt10error_codemEES5_mEEEEvRPNS1_11strand_implERT_+0x70) [0x10db4e0]
     mongod(_ZN4asio6detail14strand_service8dispatchINS0_17rewrapped_handlerINS0_7binder2INS0_7read_opINS_19basic_stream_socketINS_2ip3tcpENS_21stream_socket_serviceIS8_EEEENS_17mutable_buffers_1ENS0_14transfer_all_tENS0_15wrapped_handlerINS_10io_service6strandESt8functionIFvSt10error_codemEENS0_26is_continuation_if_runningEEEEESI_mEESK_EEEEvRPNS1_11strand_implERT_+0x89C) [0x10e99bc]
     mongod(_ZN4asio6detail23reactive_socket_recv_opINS_17mutable_buffers_1ENS0_7read_opINS_19basic_stream_socketINS_2ip3tcpENS_21stream_socket_serviceIS6_EEEES2_NS0_14transfer_all_tENS0_15wrapped_handlerINS_10io_service6strandESt8functionIFvSt10error_codemEENS0_26is_continuation_if_runningEEEEEE11do_completeEPvPNS0_19scheduler_operationERKSF_m+0x228) [0x10e9e78]
     mongod(_ZN4asio6detail9scheduler10do_run_oneERNS0_11scoped_lockINS0_11posix_mutexEEERNS0_21scheduler_thread_infoERKSt10error_code+0x2F1) [0x136e5f1]
     mongod(_ZN4asio6detail9scheduler3runERSt10error_code+0xC1) [0x136e811]
     mongod(+0xD01D3F) [0x1101d3f]
     mongod(execute_native_thread_routine+0x20) [0x1b7f610]
     libpthread.so.0(+0x8184) [0x7f7e80f8d184]
     libc.so.6(clone+0x6D) [0x7f7e80cba03d]
    -----  END BACKTRACE  -----
    

Failure to create a thread is often the result of a temporary failure, ie. EAGAIN "Resource temporarily unavailable". In this case, terminating the server is an overly drastic response. It would be much better if the server could handle this situation more gracefully, eg. fail the operation that caused the worker thread creation to be attempted (perhaps with a message informing the requesting application/user of the temporary failure and advising them to try again).

If the "operation which caused the worker thread to be created" is server startup, then it would be alright to terminate the server (since "server startup" has failed). Any worker threads which are created after server startup, but are absolutely essential could also terminate the server — but presumably this wouldn't be all of them (eg. threads for ASIO egress).



 Comments   
Comment by ADAM Martin (Inactive) [ 03/Jan/19 ]

Presently, all failure-to-create-thread events are fatal, for the reasons discussed in my previous comment.  We have no plans to change this, at this time, since every server-critical thread would have to be addressed.

Comment by ADAM Martin (Inactive) [ 03/Jan/19 ]

This is not something we can fix, at this time.  Although the server spawns numerous background threads, as needed, and therefore out-of-resource situations (not enough threads, for example) might be made recoverable; the server has numerous required threads which are also spawned on-demand.  Until every one of those required threads is launched at startup (or at least provisioned in a way to prevent fatal circumstances), insufficient resources for new threads will always be an issue.  Basically, even though we could make this particular failure-to-create-thread event non-fatal, when running near the wall (as this user probably is), some OTHER failure-to-create-thread event will be fatal anyway.

Comment by Andrew Morrow (Inactive) [ 15/Oct/18 ]

adam.martin@mongodb.com - Queuing this up for you for next sprint to investigate whether it has already been fixed.

Generated at Thu Feb 08 04:33:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.