-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Filesystem API
-
None
-
Storage Engines
__directory_list_worker() is a helper function that iterates through a directory and returns directory the directory entries – optionally filtered by a prefix.
We have had a small pattern of errors (3 HELP tickets over 5 years) where the closedir() call at the end of that function has returned EBADF, "Bad file descriptor".
The code is pretty simple, and I don't see any obvious errors in it. But the fact that we have had this repeated failure suggests there could be something going wrong in WT – If it were a more generic failure – a kernel bug, or memory corruption – I would not expect failures only on this one place where we close a file descriptor.
The "bad file descriptor" error happens when a process tries to perform a file operation (closing a directory in this case) on a file descriptor and the kernel says, "I don't recognize that file descriptor". There are a few ways it might happen:
- The process tries to use a file descriptor that has already been closed.
- Memory corruption in the process scrambles a valid file descriptor.
- The process opens a file, and doesn't check for errors, then tries to use the resulting (non-existent) file descriptor from the open
- Bug in the kernel.
So possibly:
- Some other thread closes the file descriptor out from under us. But if so, why do we only fail at close, and not on the preceding reader calls?
- Some other thread steps on our (stack) memory. But if so, it is again surprising that we don't see other manifestations.
- We shouldn't have this failure mode – we return via WT_RET_MSG if the initial opener call fails. And the reported failures don't include that failure message.
- If there is some bug in the Linux kernel – possibly only triggered because MongoDB uses such large numbers of file descriptors, why does it only affect this one scenario?
I would welcome other eyes looking at this. But I'm mostly creating this ticket so that if there are future occurrences we know they are part of a larger pattern and not one-off failures.
- related to
-
WT-13657 failed: unit-test-macos on macos-13-arm64 [wiredtiger @ 901d4334]
- Closed