Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-26131

MongoDB, XFS, and SSDs

    • Type: Icon: Question Question
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.6.12, 3.0.12, 3.2.9
    • Component/s: Performance, Storage
    • Labels:
      None

      We have run into an issue with XFS’s FITRIM ioctl implementation (see: https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_discard.c#L155) (used by the fstrim command (see: https://github.com/karelzak/util-linux/blob/master/sys-utils/fstrim.c#L87)) when running against local SSDs that is severely impacting IO in general and MongoDB specifically.

      Essentially, XFS is iterating over every allocation group and issuing TRIM s for all free extents every time this ioctl is called. This, coupled with the facts that Linux’s interface to the TRIM command is both synchronous and does not support a vectorized list of ranges (see: https://github.com/torvalds/linux/blob/3fc9d690936fb2e20e180710965ba2cc3a0881f8/block/blk-lib.c#L112), is leading to a large number of extraneous TRIM commands (each of which have been observed to be slow, see: http://oss.sgi.com/archives/xfs/2011-12/msg00311.html) being issued to the disk for ranges that both the filesystem and the disk know to be free. In practice, we have seen IO disruptions of up to 2 minutes. I realize that the duration of these disruptions may be controller dependent. Unfortunately, when running on a platform like AWS, one does not have the luxury of choosing specific hardware.

      EXT4, on the other hand, tracks blocks that have been deleted since the previous FITRIM ioctl and targets subsequent TRIM s to the appropriate block ranges (see: http://blog.taz.net.au/2012/01/07/fstrim-and-xfs/). In real-world tests this significantly reduces the impact of fstrim to the point that it is un-noticeable to the database / application. We are currently switching back to EXT4 as a result.

      Alternatively, we could mount the filesystem with the discard option (as AWS suggests here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html), however, our confidence in this performing better is not high given XFS developer comments on the subject (see: http://oss.sgi.com/archives/xfs/2014-08/msg00465.html):

      It was introduced into XFS as a checkbox feature. We resisted as
      long as we could, but too many people were shouting at us that we
      needed realtime discard because ext4 and btrfs had it. Of course,
      all those people shouting for it realised that we were right in that
      it sucked the moment they tried to use it and found that performance
      was woeful. Not to mention that SSD trim implementations were so bad
      that they caused random data corruption by trimming the wrong
      regions, drives would simply hang randomly and in a couple of cases
      too many trims too fast would brick them...

      So, yeah, it was implement because lots of people demanded it, not
      because it was a good idea.

      I am aware that MongoDB strongly recommends using XFS (see: https://docs.mongodb.com/manual/administration/production-notes/#kernel-and-file-systems) and that this is because EXT4 journaling could impact Wired Tiger checkpointing under heavy write load (https://groups.google.com/forum/#!msg/mongodb-user/diGdooN_2Sw/4H7t5JTDcpAJ). Can you elaborate on this? Is this the only concern that drove the strong recommendation to go with XFS and, in MongoDB’s opinion, is this still valid given the performance issues with TRIM on Linux when running XFS on SSDs? We are currently running the MMAPv1 storage engine on MongoDB 2.6 and, as mentioned above, we have reverted to EXT4 without apparent consequence. Any more info that you could provide would really help us in weighing the pros and cons while we work toward Wired Tiger.

      Also, any more general recommendations for mitigating the disruption incurred by running fstrim would be more than welcome.

            Assignee:
            Unassigned Unassigned
            Reporter:
            gregbanks Gregory Banks
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: