[SERVER-48509] "failed to create service entry worker thread" discards root cause exception message Created: 30/May/20  Updated: 29/Oct/23  Resolved: 07/Jul/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.3.6
Fix Version/s: 4.7.0

Type: Bug Priority: Major - P3
Reporter: Oleg Pudeyev (Inactive) Assignee: Andrew Chen (Inactive)
Resolution: Fixed Votes: 1
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-49455 Improve error reporting in launchServ... Closed
Problem/Incident
Related
related to SERVER-47075 Clean up log lines in mongo/transport... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Service arch 2020-06-29, Service arch 2020-07-13
Participants:
Linked BF Score: 23

 Description   

I created a client in Ruby configured to establish 20,000 connections to local replica set servers (PSSA), as follows:

Mongo::Client.new(['localhost:34420'],min_pool_size:20000,max_pool_size:100000)

At about 6,000 connections per each of the servers, the servers start closing connections. Looking in the server log I see:

{"t":{"$date":"2020-05-29T21:49:15.723-04:00"},"s":"W", "c":"EXECUTOR","id":22993,  "ctx":"conn5604","msg":"Terminating session due to error: {status}","attr":{"status":{"code":1,"codeName":"InternalError","errmsg":"failed to create ser
vice entry worker thread"}}}

This message appears to be produced in src/mongo/transport/service_entry_point_utils.cpp:

    } catch (...) {
        return {ErrorCodes::InternalError, "failed to create service entry worker thread"};
    }

The above code appears to discard the root cause of the error, making further troubleshooting impossible.

As a MongoDB user I would like the server to provide error messages that indicate the cause of the problem, so that I can troubleshoot the problems.



 Comments   
Comment by G F [ 03/Oct/21 ]

excuse me, may I ask for an information, is it thinkable that launchServiceWorkerThread is continuously spawning a thread for each command received from a client in a continuous session, at least compiled for windows not linux ? surely I misunderstood ?

Comment by Githook User [ 07/Jul/20 ]

Author:

{'name': 'Andrew Chen', 'email': 'a.chen@mongodb.com', 'username': 'AndrooTheChen'}

Message: SERVER-48509 fixed uassert condition
Branch: master
https://github.com/mongodb/mongo/commit/f39585101d93f47c216ea8c30e276ac0410c30a2

Comment by Githook User [ 07/Jul/20 ]

Author:

{'name': 'Andrew Chen', 'email': 'a.chen@mongodb.com', 'username': 'AndrooTheChen'}

Message: SERVER-48509 More revisions to exception logging
Branch: master
https://github.com/mongodb/mongo/commit/3974f1e8dd738579177245d1f9d0238cb84ac81f

Comment by Githook User [ 07/Jul/20 ]

Author:

{'name': 'Andrew Chen', 'email': 'a.chen@mongodb.com', 'username': 'AndrooTheChen'}

Message: SERVER-48509 Added uassert and modified catch block
Branch: master
https://github.com/mongodb/mongo/commit/64fa847bcb999ceb5d01e4e39b266fbacdbb00cb

Comment by Githook User [ 07/Jul/20 ]

Author:

{'name': 'Andrew Chen', 'email': 'a.chen@mongodb.com', 'username': 'AndrooTheChen'}

Message: SERVER-48509 Catch and log exceptions when creating threads fail
Branch: master
https://github.com/mongodb/mongo/commit/0230f90f1b8e3bda4af59f4b32b5f265220f420a

Comment by Benjamin Caimano (Inactive) [ 01/Jun/20 ]

oleg.pudeyev, there should also be this log statement which does specify the reason behind the failure. That said, the try-catch behavior does have a small possibility for us to swallow exceptions if they weren't thrown in those lines. I think this is separate from SERVER-47075 but maybe worth doing a touch of work: we can throw with the error statement we expect, catch a std::exception, and return a status with `e.what()` attached. We should also probably mark launchServiceWorkerThread noexcept.

Comment by Bruce Lucas (Inactive) [ 30/May/20 ]

Looks like we'll be touch this log line soon as part of SERVER-47075; maybe we can fix this then.

Comment by Oleg Pudeyev (Inactive) [ 30/May/20 ]

Test code: https://github.com/p-mongo/tests/blob/master/connect-limit/test.rb

Not relevant to this ticket but connections for each server are established sequentially. The three servers are being connected to in parallel.

Output: https://gist.github.com/p-mongo/aaf3a9351e46c2b06bf25f6d3b5c4ee1

It seems that when the total # of connections in the system is about 10,000, the server fails in the manner indicated. The connections could be split evenly or unevenly across the server processes. Sometimes a particular server does not fail when the total number of connections is 10,000 (possibly because it is not being connected to at that moment, when the next connection happens the number of connections has already dropped).

Generated at Thu Feb 08 05:17:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.