Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Critical - P2
Fix Version/s: 2.5.5
Affects Version/s: None
Component/s: Storage
Labels:
None

Backwards Compatibility:
Major Change
Operating System:
ALL
Case:
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

On Linux, storage errors, specifically failed writes, can result in unnecessary and (mostly) unsignaled data loss and database corruption. This is due to a combination of two factors:

The Linux kernel does not retry failed page writes, even on subsequent msyncs, unless the application re-dirties the pages (see below for details).
After a failed msync mongod clears the journal, logs a slightly obscure message, and continues operation with a now corrupted database.

The net result is that after a storage write error a customer may continue operating for some time with data loss and a corrupted database without being aware of it, until the data in question is next acccessed, at which point they may see an error, typically a bson object size error.

Possible mongod improvements:

Transient errors: on msync failure mongod could
- not clear the journal until a successful msync
- replay the journal to re-dirty the pages (and restore data possibly lost from mapped memory) which will force the kernel to retry the writes on a subsequent msync
Permanent errors: on msync failure mongod could
- not clear the journal
- issue a message warning of likely data loss and database corruption
- fassert to shut down, alerting the user to the issue and allowing them to take corrective action sooner rather than later
The error return from msync does not distinguish between transient and permanent failures, so a hybrid heuristic could be adopted: retry the msync as in the first option above for some (configurable) period of time or number of msync attempts, then treat the error as permanent and fassert as in the second option above.

KERNEL BEHAVIOR ON FAILED WRITES

The following behavior was observed by doing SCSI error injects using systemtap on Linux (CentOS 6.5, 2.6.32-431.el6.centos.plus.x86_64 kernel).

create and zero a 16 MB file using dd, sync it to disk
then in parallel do:
- thread 1
  - simple standalone C program:
    - open and mmap the file
    - write some portion of file with non-zero data via the mmap
    - msync the file, watch return code
    - msync file a second time, watch return code
  - then remount fs to make sure kernel buffers are all flushed to disk
- thread 2 - systemtap script that
  - prints SCSI write commands issued against the file
  - errors out exactly one of the commands
then hexdump file to observe content

Observed that

whether the SCSI error happens during the first msync or before it (page flushed due to memory pressure, triggered by increasing file size from 16 MB to 1 GB) the first msync reports i/o error, meaning we are guaranteed to see an error even if it occurred before the msync
the second msync however does does not reattempt the failed page write and does not report i/o error
in fact the failed page write is never reattempted at any time up through the remount
hexdump after the remount shows the data never made it to disk, confirming the data loss

On the other hand, if the C program is modified to re-dirty the pages before the second msync by rewriting them, the second msync triggers a new attempt to write the previously failed pages, and examining the data for the initially failed pages does make it to disk on the second attempt.

(Similar behavior was also reproduced by using mongod and two fsync db commands, observing msync error log message for first but not second fsync, no reattempt of failed page writes, and observing corrupted db file content by inspection.)

LOGGING

Mongod log message is as follows (can be on different threads depending on how msync is triggered):

[DataFileSync]   msync errno:5 Input/output error

Suggest improving, at least in non-recoverable case, to indicate likely data loss and database corruption due to storage errors, and need for corrective action.

Kernel logs messages such as the following (details depend on type of error) to syslog and dmesg:

sd 0:0:2:0: timing out command, waited 1080s
sd 0:0:2:0: SCSI error: return code = 0x06000008
sd 0:0:0:0: [sda] Unhandled error code
sd 0:0:0:0: [sda]  Result: hostbyte=DID_OK driverbyte=DRIVER_TIMEOUT
sd 0:0:0:0: [sda] CDB: Write(10): 2a 00 00 48 f2 50 00 00 10 00
end_request: I/O error, dev sdc, sector 6201448
Buffer I/O error on device dm-2, logical block 601268
lost page write due to I/O error on dm-2

Assignee:: Eliot Horowitz (Inactive)
Reporter:: Bruce Lucas (Inactive)
Participants:: Bruce Lucas, Eliot Horowitz, Eric Milkie, Githook User
Votes:: 0 Vote for this issue
Watchers:: 19 Start watching this issue

Created:: Jan 06 2014 02:47:11 PM UTC
Updated:: Sep 21 2017 04:52:40 PM UTC
Resolved:: Jan 09 2014 07:38:50 PM UTC

Details

Description

Attachments

Forms

Activity

People

Dates