Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-31081

Too many open file descriptors not handled gracefully

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Investigating
    • Priority: Major - P3
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Internal Code
    • Labels:
      None
    • Operating System:
      ALL
    • Sprint:
      Service Arch 2019-11-04, Service Arch 2019-11-18, Service Arch 2019-12-02, Service Arch 2019-12-16, Service Arch 2019-12-30, Service Arch 2020-01-13, Service Arch 2020-01-27
    • Case:

      Description

      File descriptors are used for multiple purposes, largely socket connections and WiredTiger tables, but also journal files, ftdc files, log files, and so on. When mongod runs out of file descriptors by hitting the RLIMIT_NOFILES multiple internal failures can occur:

      2017-09-13T14:05:25.125-0400 I NETWORK  [thread1] Listener: accept() returns -1 Too many open files
      2017-09-13T14:05:25.125-0400 E NETWORK  [thread1] Out of file descriptors. Waiting one second before trying to accept more connections.
      ...
      2017-09-13T14:05:26.000-0400 I -        [ftdc] Assertion: 13538:couldn't open [/proc/12839/stat] Too many open files src/mongo/util/processinfo_linux.cpp 74
      2017-09-13T14:05:26.000-0400 W FTDC     [ftdc] Uncaught exception in 'Location13538: couldn't open [/proc/12839/stat] Too many open files' in full-time diagnostic data capture subsystem. Shutting down the full-time diagnostic data capture subsystem.
      ...
      2017-09-13T14:05:30.427-0400 E STORAGE  [conn5] WiredTiger error (24) [1505325930:427737][12839:0x7fc987931700], WT_SESSION.commit_transaction: /ssd/db/./r0/journal: directory-list: opendir: Too many open files
      

      After this occurs the node is either not operational, or is operating in a degraded state, but no failover has occurred. We should either

      • bound use of file descriptors for new connections and new tables by refusing new connections and failing new table creation, or
      • detect this condition reliably and abort to trigger failover

      Note: currently new connections are limited to 80% of RLIMIT_NOFILE which only partially mitigates this issue because the use of remaining file descriptors for WT tables is not bounded.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                26 Vote for this issue
                Watchers:
                29 Start watching this issue

                Dates

                • Created:
                  Updated: