-
Type:
Question
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Admin
-
None
-
Environment:3.2.5 WT
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
Hi Mongo,
We had a bit of a problem this weekend with one of our primaries. The primary was unable to create new threads to handle requests, which effectively took it down. However, it kept responding to the rest of its replica set (presumably on an older, long-lived thread) so no automated failover took place. During the failure, the primary's log is filled with these two lines, repeated ad infinitum:
2016-06-25T23:49:29.455+0000 I NETWORK [initandlisten] failed to create thread after accepting new connection, closing connection 2016-06-25T23:49:29.457+0000 I NETWORK [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable
My guess is that this problem is usually operator error caused by misconfigured resource limits. I think ours were fine, though, so I'm a bit puzzled.
Here's the output from ulimit -a on this system:
core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 515188 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 100000 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 515188 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
And /proc/sys/kernel/threads-max is set to 1030376. We restarted the primary once we detected the problem, and the mongod seems to hover between 1K and 5K threads (as measured by ps -eLf | grep mongo | wc -l) under our usual load patterns. I'm not sure how we could have exceeded the limits I'm seeing here, so I think I might be misunderstanding something.
Two questions for you:
1. What limits might be being exceeded to cause these pthread_create errors? Perhaps I am interpreting ulimit incorrectly.
2. Any suggestions for system metrics we might monitor to detect and prevent this sort of problem going forward?
Thanks much,
Travis