Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-6645

Ensure correct WT behavior when running out of storage space

    • Type: Icon: Improvement Improvement
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • 5

      WT-4065 describes the results of some ad hoc testing to determine how WT currently behaves when it runs out of storage space.

      In order to more rigorously test WT in out-of-space failures, we need to define the expected behavior in this scenario.

      UPDATE:  Here is my current proposal for the guarantees WT should provide when it encounters an out-of-space failure. My goal here is to define what we think the current system can/should do today.  We can then make sure we have tests to ensure we are meeting these requirements.  If we want to add new/different out-of-space behavior, that will be defined in additional tickets.

      1. ENOSPC errors should always be reported.
        1. If an ENOSPC error affects the correctness of an API operation (e.g., causing an call to fail or WT to panic), the error should be reported to the application using existing mechanisms, such as error return codes or a handler for WT_PANIC.
        2. If an ENOSPC error does not affect correctness (e.g., a failure during preallocation of a log file), it should, at a minimum, be reported as a log message.
      2. After resolving the out-of-space condition (e.g., by freeing some storage space, growing the file system, or copying the files to someplace with more capacity):
        1.  WT should be able to recover and operate correctly.
        2. All operations and transactions that successfully completed before the out-of-space condition should be present as expected after recovery.  I.e., if we would expect an update to be durable after a power loss, it should be durable after an out-of-space event at the same point. Testing for this is will be addressed in WT-6651.

      For the record, here is my original proposal, which was discussed in the first 12 comments, below:

      1. ENOSPC errors should always be reported to the application that is using WiredTiger; WiredTiger should not fail with no explanation. 
      2. After resolving the out-of-space condition (e.g., by freeing some storage space, growing the file system, or copying the files to someplace with more capacity), WT should be able to recover and operate correctly.
      3. Without resolving the out-of-space condition, WT should be able to access all of its data in read-only mode.  
      4. All operations and transactions that successfully completed before the out-of-space condition should be present as expected after recovery.  I.e., if we would expect an update to be durable after a power loss, it should be durable after an out-of-space event at the same point.

      After some informal testing, we appear to meet #1 and #2 already, and #3 in most cases (see WT-4065).  #4 is untested.

            Assignee:
            keith.smith@mongodb.com Keith Smith
            Reporter:
            keith.smith@mongodb.com Keith Smith
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: