[SERVER-22819] WiredTiger collection file read in 4k blocks Created: 23/Feb/16  Updated: 28/Mar/16  Resolved: 28/Mar/16

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 3.2.3
Fix Version/s: None

Type: Question Priority: Minor - P4
Reporter: Vlad Galu Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

FreeBSD 10.2, ZFS


Participants:

 Description   

We are running an idle replica set that we have just added a new secondary to. Initial synchronization takes a long time and we have narrowed it down to the primary reading the collection file from the filesystem in 4k chunks. Is this by design or are we doing something wrong? We could surely speed this up by using a larger block size. Obviously the 4k chunk size makes sense on a write-busy primary, but we were wondering whether or not there was any autotuning available to make it read the data set more aggressively while idling.



 Comments   
Comment by Ramon Fernandez Marina [ 28/Mar/16 ]

vgalu, have you had a chance to test Michael's suggestions above? Since there's no bug in the server, and the SERVER project is to report bugs and feature requests for the MongoDB server, I'm going to close this ticket for the time being.

If some of your tests show that a different setting in WiredTiger provides performance improvements we can always reopen this ticket and repurpose it as an improvement request.

Thanks,
Ramón.

Comment by Michael Cahill (Inactive) [ 14/Mar/16 ]

vgalu, from your description, it sounds like prefetch / readahead would help in this case.

In some previous benchmarks with WiredTiger, we found that readahead caused more I/O and slower throughput for some workloads, so we currently use posix_fadvise with the POSIX_FADV_RANDOM flag to hint that readahead should be disabled.

Are you able to run tests with a local build of MongoDB? If so, try disabling HAVE_POSIX_FADVISE in src/third_party/wiredtiger/build_freebsd/wiredtiger_config.h and rebuilding to see whether that improves performance. If it does, we can discuss whether changing the default behavior is reasonable.

Comment by Vlad Galu [ 26/Feb/16 ]

Hi ~michael.cahill, thanks for looking into this.

Our application uses its own _id fields, which are VERY high cardinality 16 byte arrays. For all intents and purpose, they can be considered random. The insertion process is indeed aggressive, but does not use the bulk feature, writing one document at a time with

{ w: majority }

instead.

The hybrid ZFS pool uses 4k blocks on the spinning drives and uses bog standard settings, except for the ARC which is capped at 8GB (the rest of the RAM up to 32GB is the WiredTiger cache). The L2ARC is sized to 400GB, of which during these tests about 10GB were used. zlib is the compression algorithm for both journal and collection files.

When we looked at the truss output, we noticed that offsets passed to pread() calls on the collection file were sequential rather than random, hinting at a programatically imposed constraint.

Hope this helps
Vlad

Comment by Michael Cahill (Inactive) [ 26/Feb/16 ]

vgalu, sorry to hear that you are having performance problems with MongoDB and WiredTiger.

WiredTiger will usually lay out files sequentially and read them in the same order, and uses variable-sized blocks internally (that are multiples of 4KB). With default settings, we try to create blocks that are close to 32KB in memory, then compress them to whatever size snappy compression results in.

Can you describe more about how the data on your primary was created? Do you just insert documents or have an update-heavy workload? Do you use default _id fields or set your own, and if the latter, how are they generated? This matters because many operations read documents via the _id index, so random keys can lead to random I/O patterns rather than sequential.

In terms of things that can improve performance, how have you configured readahead/prefetch for the filesystem? Are the pools backed by SSDs or spinning disks, and if the latter, do you have an SSD cache?

Generated at Thu Feb 08 04:01:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.