Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24812

Thread starvation even with proper ulimits

    • Type: Icon: Question Question
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Admin
    • Labels:
      None
    • Environment:
      3.2.5 WT

      Hi Mongo,

      We had a bit of a problem this weekend with one of our primaries. The primary was unable to create new threads to handle requests, which effectively took it down. However, it kept responding to the rest of its replica set (presumably on an older, long-lived thread) so no automated failover took place. During the failure, the primary's log is filled with these two lines, repeated ad infinitum:

      2016-06-25T23:49:29.455+0000 I NETWORK  [initandlisten] failed to create thread after accepting new connection, closing connection
      2016-06-25T23:49:29.457+0000 I NETWORK  [initandlisten] pthread_create failed: errno:11 Resource temporarily unavailable
      

      My guess is that this problem is usually operator error caused by misconfigured resource limits. I think ours were fine, though, so I'm a bit puzzled.

      Here's the output from ulimit -a on this system:

      core file size          (blocks, -c) 0
      data seg size           (kbytes, -d) unlimited
      scheduling priority             (-e) 0
      file size               (blocks, -f) unlimited
      pending signals                 (-i) 515188
      max locked memory       (kbytes, -l) 64
      max memory size         (kbytes, -m) unlimited
      open files                      (-n) 100000
      pipe size            (512 bytes, -p) 8
      POSIX message queues     (bytes, -q) 819200
      real-time priority              (-r) 0
      stack size              (kbytes, -s) 8192
      cpu time               (seconds, -t) unlimited
      max user processes              (-u) 515188
      virtual memory          (kbytes, -v) unlimited
      file locks                      (-x) unlimited
      

      And /proc/sys/kernel/threads-max is set to 1030376. We restarted the primary once we detected the problem, and the mongod seems to hover between 1K and 5K threads (as measured by ps -eLf | grep mongo | wc -l) under our usual load patterns. I'm not sure how we could have exceeded the limits I'm seeing here, so I think I might be misunderstanding something.

      Two questions for you:

      1. What limits might be being exceeded to cause these pthread_create errors? Perhaps I am interpreting ulimit incorrectly.
      2. Any suggestions for system metrics we might monitor to detect and prevent this sort of problem going forward?

      Thanks much,
      Travis

            Assignee:
            Unassigned Unassigned
            Reporter:
            travis@gamechanger.io Travis Thieman
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: