Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-9325

Use fsync() instead of fdatasync() on BSD

    • 3
    • StorEng - Defined Pipeline

      When flushing a file to disk on a POSIX system, WiredTiger will call either fdatasync() or fsync(), depending on whether the HAVE_FDATASYNC compile file was set. The difference between these calls is that fsync() flushes both file content and file metadata, such as update time and file size, while fdatasync() flushes just the file data. On some systems, such as Linux, fdatasync() will also flush the file size if it has changed. But other systems, including some (all?) BSD variants don't flush the file size with fdatasync().  

      This means that if a developer configures WT with HAVE_FDATASYNC on a BSD system, we might update the turtle file, sync it, but fail to update the size, leaving us with a zero-size file after a poorly timed failure.

      I can't prove that this could actually happen. But given the differences in operating system and file system implementations, I can't convince myself that it isn't a risk.  So it seems that it would be safest to always use fsync() when updating the turtle file.

      Additional background.

      Here's what the Linux man page says about fdatasync:

      fdatasync() is similar to fsync(), but does not flush modified metadata unless that  meta-data is needed in order to allow a subsequent data retrieval to be correctly handled.  For example, changes to st_atime or st_mtime (respectively, time of last access  and  time  of last  modification;  see  inode(7)) do not require flushing because they are not necessary for a subsequent data read to be handled correctly.  On the other hand, a  change  to  the file size (st_size, as made by say ftruncate(2)), would require a metadata flush.

      Here's the corresponding information from FreeBSD:

      fdatasync  does	the  same as fsync but only flushes user data, not the meta data like the mtime or atime.

      I only call out the turtle file here because we typically update other files (logs and tables) by using fallocate() to extend them, which should update the on-disk file size, after which we don't care if fdatasync() updates the file size, because it shouldn't change.

      Note that MacOS uses a different mechanism to implement a reliable data sync, so none of this applies there.

            monica.ng@mongodb.com Monica Ng
            keith.smith@mongodb.com Keith Smith
            0 Vote for this issue
            10 Start watching this issue