[SERVER-7287] --tinyfiles (32M) option for effective GridFS datafile caching on large file server Created: 07/Oct/12  Updated: 15/Feb/13  Resolved: 10/Oct/12

Status: Closed
Project: Core Server
Component/s: GridFS, Performance, Storage
Affects Version/s: 2.2.0
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Alex Yam Assignee: Mathias Stearn
Resolution: Done Votes: 0
Labels: Cache, GridFS, ZFS, chunks, datafiles
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ZFS


Participants:

 Description   

Currently --smallfiles option lets GridFS save file chunks into 512MB datafiles, this is not small enough for effective read caching on ZFS, and is causing problems on our ZFS+GridFS file server:

1. Our GridFS server runs on ZFS, total data size is ~30TB, Ram is 64G.

2. Using the --smallfiles option, GridFS saves ~30TB, ~12 million files as chunks in ~60,000 x 512MB datafiles.

3. ZFS first uses RAM as file read cache, when RAM is filled up, ZFS then uses SSD as file read cache (L2ARC).

4. Our application has ~30,000 hot files being fetched frequently at random hours.

5. GridFS spreads the chunks of these 30,000 files across 1,000+ 512MB datafiles.

6. Using a 256GB SSD as read cache for ZFS, ZFS only has enough space to cache ~460 GridFS datafiles of 512MB each.

7. Excessive mechanical disk seeks are caused by the application making requests to GridFS for chunks in datafiles not cached by ZFS.

8. Adding more Ram does almost nothing to help the situation, using a bigger (512GB) SSD is not cost effective and will still cause disk seeks when a file chunk is saved in the ~921th datafile.

9. If --tinyfiles enables GridFS to save chunks into 32MB datafiles, then a 256GB SSD can now cache ~7400 datafiles which contains all the hot chunks, and reduce mechanical disk seek to almost zero.

Bottom line: Tiny (32MB) datafile size may not make a difference for normal MongoDB data, but is critical for large GridFS servers on ZFS.



 Comments   
Comment by Alex Yam [ 10/Oct/12 ]

After a few days of head scratching we finally found the problem, the HDD seeks were caused by Nginx deleting old cache files and writing new ones (LRU replacement policy when the cache is full), we missed this setting when we merged the Nginx config files from different servers, as a result large files were sharing the same small cache zone with the avatars/images.

Took a while to find the problem because the blinkings were inconsistent, there were 4 different caches in play, some get flushed after reboot but some don't:
1. ram cache for MongoDB
2. ram cache for ZFS
3. SSD cache for ZFS
4. HDD cache for Nginx

Problem was solved by placing images/avatars and large files in different Nginx cache zones, after the change, no more disk seeks for images/avatars after hammering the large files server.

Hope this ticket can help others who come across the same situation.

Comment by Alex Yam [ 07/Oct/12 ]

If ZFS cache is block based then we must have misconfigured something.

We recently merged multiple GridFS DBs into a new ZFS server to improve reliability, when we stress tested the GridFS server, we discovered that the HDD lights on the hot swap bays were blinking nonstop.

Stress tests were done using http_load and 2 URL lists:
1st list consists of small files, with URLs equivalent to http://image_server/gridfs_fetch.php?filename=avatar_{$userid}.jpg
2nd list consists of larger/huge files, with URLs equivalent to http://download_server/gridfs_fetch.php?filename={$file_name}_{$file_id}.{$file_ext}

OS is running FreeBSD9, HTTP is Nginx+PHP-FPM, ZFS zpool contains a raidz3 vdev using 11 x 3TB disks, 8 of which are connected to a HBA and 3 connected to on board sata ports.

Our test procedures:
1. Reboot to flush all caches
2. Run stress test with 1st URL list multiple times (avatar and images URLs) — HDD lights on the hot swap bays blink in the first run, but not in the second/third run.
3. Run stress test with 2nd URL list (download URLs) — HDD lights blink for as long as it runs.
4. Run stress test with 1st URL list again — HDD lights blink again, but no blinking at second/third run.

Running "zpool iostat -v" from the shell shows the read cache SSD only has 16M free space left so the cache is working.

We are new to ZFS, so when the HDD lights blink at step 4, we assumed this was caused by ZFS caching entire GridFS datafiles at step 3 and cleared the cache at step 2.

The blinking HDD lights tell us if the system goes live, the random avatar/images fetches will wear out our disks more than necessary, this is where we are stuck at the moment.

Are there tools we can use to pin point exactly what files are being accessed?

Comment by Eliot Horowitz (Inactive) [ 07/Oct/12 ]

Filesystems (zfs) included don't have to cache the entire file, they can cache only the pages they need.
Do you see evidence of something else here?

Generated at Thu Feb 08 03:14:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.