Loading...

XML

Word

Printable

JSON

Type: Question
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Admin
Labels:
None
Environment:
3.2.5 WT

Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Hi Mongo,

We had a bit of a problem this weekend with one of our primaries. The primary was unable to create new threads to handle requests, which effectively took it down. However, it kept responding to the rest of its replica set (presumably on an older, long-lived thread) so no automated failover took place. During the failure, the primary's log is filled with these two lines, repeated ad infinitum:

2016-06-25T23:49:29.455+0000 I NETWORK  [initandlisten] failed to create thread after accepting new connection, closing connection
2016-06-25T23:49:29.457+0000 I NETWORK  [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable

My guess is that this problem is usually operator error caused by misconfigured resource limits. I think ours were fine, though, so I'm a bit puzzled.

Here's the output from ulimit -a on this system:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 515188
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 100000
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 515188
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

And /proc/sys/kernel/threads-max is set to 1030376. We restarted the primary once we detected the problem, and the mongod seems to hover between 1K and 5K threads (as measured by ps -eLf | grep mongo | wc -l) under our usual load patterns. I'm not sure how we could have exceeded the limits I'm seeing here, so I think I might be misunderstanding something.

Two questions for you:

1. What limits might be being exceeded to cause these pthread_create errors? Perhaps I am interpreting ulimit incorrectly.
2. Any suggestions for system metrics we might monitor to detect and prevent this sort of problem going forward?

Thanks much,
Travis

Assignee:: Unassigned
Reporter:: Travis Thieman
Participants:: Ramon Fernandez Marina, Travis Thieman
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Jun 27 2016 01:39:36 PM UTC
Updated:: Jul 14 2016 04:05:16 PM UTC
Resolved:: Jun 27 2016 01:54:33 PM UTC

Details

Description

Attachments

Activity

People

Dates