Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-13609

EBADF returned by closedir in __directory_list_worker

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • WT11.3.1
    • Affects Version/s: None
    • Component/s: Filesystem API
    • None
    • Storage Engines

      __directory_list_worker() is a helper function that iterates through a directory and returns directory the directory entries – optionally filtered by a prefix.

      We have had a small pattern of errors (3 HELP tickets over 5 years) where the closedir() call at the end of that function has returned EBADF, "Bad file descriptor".

      The code is pretty simple, and I don't see any obvious errors in it. But the fact that we have had this repeated failure suggests there could be something going wrong in WT – If it were a more generic failure – a kernel bug, or memory corruption – I would not expect failures only on this one place where we close a file descriptor.

      The "bad file descriptor" error happens when a process tries to perform a file operation (closing a directory in this case) on a file descriptor and the kernel says, "I don't recognize that file descriptor". There are a few ways it might happen:

      1. The process tries to use a file descriptor that has already been closed. 
      2. Memory corruption in the process scrambles a valid file descriptor.
      3. The process opens a file, and doesn't check for errors, then tries to use the resulting (non-existent) file descriptor from the open
      4. Bug in the kernel.

      So possibly:

      1. Some other thread closes the file descriptor out from under us. But if so, why do we only fail at close, and not on the preceding reader calls?
      2. Some other thread steps on our (stack) memory. But if so, it is again surprising that we don't see other manifestations.
      3. We shouldn't have this failure mode – we return via WT_RET_MSG if the initial opener call fails. And the reported failures don't include that failure message.
      4. If there is some bug in the Linux kernel – possibly only triggered because MongoDB uses such large numbers of file descriptors, why does it only affect this one scenario?

      I would welcome other eyes looking at this. But I'm mostly creating this ticket so that if there are future occurrences we know they are part of a larger pattern and not one-off failures.

            Assignee:
            sue.loverso@mongodb.com Susan LoVerso
            Reporter:
            keith.smith@mongodb.com Keith Smith
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: