[SERVER-26131] MongoDB, XFS, and SSDs Created: 15/Sep/16  Updated: 20/Sep/16  Resolved: 16/Sep/16

Status: Closed
Project: Core Server
Component/s: Performance, Storage
Affects Version/s: 2.6.12, 3.0.12, 3.2.9
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Gregory Banks Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

We have run into an issue with XFS’s FITRIM ioctl implementation (see: https://github.com/torvalds/linux/blob/master/fs/xfs/xfs_discard.c#L155) (used by the fstrim command (see: https://github.com/karelzak/util-linux/blob/master/sys-utils/fstrim.c#L87)) when running against local SSDs that is severely impacting IO in general and MongoDB specifically.

Essentially, XFS is iterating over every allocation group and issuing TRIM s for all free extents every time this ioctl is called. This, coupled with the facts that Linux’s interface to the TRIM command is both synchronous and does not support a vectorized list of ranges (see: https://github.com/torvalds/linux/blob/3fc9d690936fb2e20e180710965ba2cc3a0881f8/block/blk-lib.c#L112), is leading to a large number of extraneous TRIM commands (each of which have been observed to be slow, see: http://oss.sgi.com/archives/xfs/2011-12/msg00311.html) being issued to the disk for ranges that both the filesystem and the disk know to be free. In practice, we have seen IO disruptions of up to 2 minutes. I realize that the duration of these disruptions may be controller dependent. Unfortunately, when running on a platform like AWS, one does not have the luxury of choosing specific hardware.

EXT4, on the other hand, tracks blocks that have been deleted since the previous FITRIM ioctl and targets subsequent TRIM s to the appropriate block ranges (see: http://blog.taz.net.au/2012/01/07/fstrim-and-xfs/). In real-world tests this significantly reduces the impact of fstrim to the point that it is un-noticeable to the database / application. We are currently switching back to EXT4 as a result.

Alternatively, we could mount the filesystem with the discard option (as AWS suggests here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html), however, our confidence in this performing better is not high given XFS developer comments on the subject (see: http://oss.sgi.com/archives/xfs/2014-08/msg00465.html):

It was introduced into XFS as a checkbox feature. We resisted as
long as we could, but too many people were shouting at us that we
needed realtime discard because ext4 and btrfs had it. Of course,
all those people shouting for it realised that we were right in that
it sucked the moment they tried to use it and found that performance
was woeful. Not to mention that SSD trim implementations were so bad
that they caused random data corruption by trimming the wrong
regions, drives would simply hang randomly and in a couple of cases
too many trims too fast would brick them...

So, yeah, it was implement because lots of people demanded it, not
because it was a good idea.

I am aware that MongoDB strongly recommends using XFS (see: https://docs.mongodb.com/manual/administration/production-notes/#kernel-and-file-systems) and that this is because EXT4 journaling could impact Wired Tiger checkpointing under heavy write load (https://groups.google.com/forum/#!msg/mongodb-user/diGdooN_2Sw/4H7t5JTDcpAJ). Can you elaborate on this? Is this the only concern that drove the strong recommendation to go with XFS and, in MongoDB’s opinion, is this still valid given the performance issues with TRIM on Linux when running XFS on SSDs? We are currently running the MMAPv1 storage engine on MongoDB 2.6 and, as mentioned above, we have reverted to EXT4 without apparent consequence. Any more info that you could provide would really help us in weighing the pros and cons while we work toward Wired Tiger.

Also, any more general recommendations for mitigating the disruption incurred by running fstrim would be more than welcome.



 Comments   
Comment by Gregory Banks [ 20/Sep/16 ]

https://groups.google.com/forum/#!topic/mongodb-user/Mj0x6m-02Ms

Comment by Gregory Banks [ 16/Sep/16 ]

Thanks Thomas. I'll move discussion over to the group.

Cheers,
Greg

Comment by Kelsey Schubert [ 16/Sep/16 ]

Hi gregbanks,

Thank you for the detailed question. We recommend XFS since we have observed long pauses related to EXT4. However, if you have tested your workload with WiredTiger on EXT4 and see better results, then I don't see a reason why you can't move forward with it.

Please note that SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-users group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group.

Kind regards,
Thomas

Generated at Thu Feb 08 04:11:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.