Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: WT11.3.1
Affects Version/s: None
Component/s: Filesystem API
Labels:
None

Assigned Teams:

Storage Engines
Sprint:
None
Story Points:
None

__directory_list_worker() is a helper function that iterates through a directory and returns directory the directory entries – optionally filtered by a prefix.

We have had a small pattern of errors (3 HELP tickets over 5 years) where the closedir() call at the end of that function has returned EBADF, "Bad file descriptor".

The code is pretty simple, and I don't see any obvious errors in it. But the fact that we have had this repeated failure suggests there could be something going wrong in WT – If it were a more generic failure – a kernel bug, or memory corruption – I would not expect failures only on this one place where we close a file descriptor.

The "bad file descriptor" error happens when a process tries to perform a file operation (closing a directory in this case) on a file descriptor and the kernel says, "I don't recognize that file descriptor". There are a few ways it might happen:

The process tries to use a file descriptor that has already been closed.
Memory corruption in the process scrambles a valid file descriptor.
The process opens a file, and doesn't check for errors, then tries to use the resulting (non-existent) file descriptor from the open
Bug in the kernel.

So possibly:

Some other thread closes the file descriptor out from under us. But if so, why do we only fail at close, and not on the preceding reader calls?
Some other thread steps on our (stack) memory. But if so, it is again surprising that we don't see other manifestations.
We shouldn't have this failure mode – we return via WT_RET_MSG if the initial opener call fails. And the reported failures don't include that failure message.
If there is some bug in the Linux kernel – possibly only triggered because MongoDB uses such large numbers of file descriptors, why does it only affect this one scenario?

I would welcome other eyes looking at this. But I'm mostly creating this ticket so that if there are future occurrences we know they are part of a larger pattern and not one-off failures.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

13609.diff
3 kB
Oct 08 2024 07:13:31 PM UTC

related to

WT-13657 failed: unit-test-macos on macos-13-arm64 [wiredtiger @ 901d4334]

Closed

Assignee:: Susan LoVerso (Inactive)
Reporter:: Keith Smith
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Oct 03 2024 06:00:03 PM UTC
Updated:: Nov 14 2024 12:15:40 AM UTC
Resolved:: Nov 12 2024 05:16:41 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates