Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Sprint:
None
Story Points:
5

~~WT-4065~~ describes the results of some ad hoc testing to determine how WT currently behaves when it runs out of storage space.

In order to more rigorously test WT in out-of-space failures, we need to define the expected behavior in this scenario.

UPDATE: Here is my current proposal for the guarantees WT should provide when it encounters an out-of-space failure. My goal here is to define what we think the current system can/should do today. We can then make sure we have tests to ensure we are meeting these requirements. If we want to add new/different out-of-space behavior, that will be defined in additional tickets.

ENOSPC errors should always be reported.
1. If an ENOSPC error affects the correctness of an API operation (e.g., causing an call to fail or WT to panic), the error should be reported to the application using existing mechanisms, such as error return codes or a handler for WT_PANIC.
2. If an ENOSPC error does not affect correctness (e.g., a failure during preallocation of a log file), it should, at a minimum, be reported as a log message.
After resolving the out-of-space condition (e.g., by freeing some storage space, growing the file system, or copying the files to someplace with more capacity):
1. WT should be able to recover and operate correctly.
2. All operations and transactions that successfully completed before the out-of-space condition should be present as expected after recovery. I.e., if we would expect an update to be durable after a power loss, it should be durable after an out-of-space event at the same point. Testing for this is will be addressed in WT-6651.

For the record, here is my original proposal, which was discussed in the first 12 comments, below:

ENOSPC errors should always be reported to the application that is using WiredTiger; WiredTiger should not fail with no explanation.
After resolving the out-of-space condition (e.g., by freeing some storage space, growing the file system, or copying the files to someplace with more capacity), WT should be able to recover and operate correctly.
Without resolving the out-of-space condition, WT should be able to access all of its data in read-only mode.
All operations and transactions that successfully completed before the out-of-space condition should be present as expected after recovery. I.e., if we would expect an update to be durable after a power loss, it should be durable after an out-of-space event at the same point.

After some informal testing, we appear to meet #1 and #2 already, and #3 in most cases (see ~~WT-4065~~). #4 is untested.

is related to

WT-6646 Implement ENOSPC fault injection for WiredTiger

Backlog

WT-4065 Review behavior when running out of disk space

Closed

WT-6987 Create test(s) to verify that ENOSPC errors are always reported

Backlog

related to

WT-6651 Write test to verify ACID guarantees after ENOSPC failure

Backlog

Assignee:: Keith Smith
Reporter:: Keith Smith
Votes:: 0 Vote for this issue
Watchers:: 10 Start watching this issue

Created:: Sep 02 2020 09:00:05 PM UTC
Updated:: Dec 04 2020 11:34:06 PM UTC
Resolved:: Dec 04 2020 11:34:06 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates