[SERVER-37590] Investigate performance implications of using WiredTiger read-once cursors Created: 12/Oct/18  Updated: 08/Apr/20  Resolved: 07/Nov/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Louis Williams Assignee: Louis Williams
Resolution: Done Votes: 0
Labels: nyc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-37269 Use read_once cursors to do foregroun... Closed
is depended on by SERVER-36068 Expose a user-accessible cursor optio... Closed
Related
is related to SERVER-47118 Investigate the behavior of read_once... Backlog
Sprint: Storage NYC 2018-11-05, Storage NYC 2018-11-19
Participants:

 Description   

Using read_once=true cursors has shown to provide no performance benefits yet, at least in the context of background index builds. Investigate why that is the case, and decide when they should or should not be used.



 Comments   
Comment by Tess Avitabile (Inactive) [ 07/Nov/18 ]

Thanks, louis.williams! And thanks for the detailed explanation of your perf investigation.

Comment by Louis Williams [ 07/Nov/18 ]

Closing because we are going to proceed with SERVER-37269 to use read_once cursors for index builds.

tess.avitabile/matthew.russotto SERVER-36068 should be unblocked now

 

Comment by Louis Williams [ 06/Nov/18 ]

I've done a significant amount of testing to attempt to demonstrate a change in index build speed and read throughput, but found that challenging for several reasons. Instead, what I will provide is an examination of the changes in cache behavior, which indicate that despite neither significant demonstrable improvements or regressions for those metrics, I believe in general this change will still be desirable.

First, why was it hard to demonstrate a performance improvement/regression? These are some outstanding issues with my workload

  • Reliability: over 5 runs of the same workload, the results can vary up to 10-15%. It is hard to make any conclusions about performance because the workload metrics are not particularly reliable.
  • Data: I used highly compressible test documents with 7:1 compression on average, which means the cost of a cache miss incurs a lower penalty. Less compressed data needs to be paged in.
  • Filesystem cache: it turns out the operating system is really good at caching data. This means the penalty of cache a miss is extremely low. The test was previously never actually going to disk. I ran workloads by clearing the filesystem cache before each run, which showed some promising improvements in read throughput, but because of the other two reasons listed, nothing that should be used to draw conclusions.

Procedure

I modified my test procedure as follows:

  • Load data: Each collection's size is 75% of the total WT cache size, 2GB
  • Shutdown mongod then clear caches: sync && echo 3 | tee /proc/sys/vm/drop_caches
  • Do a collection scan on the hot collection to load it into the WT cache
  • Print collection.stats().wiredTiger.cache for both collections
  • Perform an index build on the "cold" collection while performing randomized reads on the "hot" collection
  • Print collection.stats().wiredTiger.cache for both collections again
  • Time a collection scan on the "hot" collection, which should provide an indication of how much data is still in cache.

Results

Each block of 3 columns shows data for read-once True, False, and the difference between the measurements, respectively. Each row shows, for each collection, how much data was in cache before, after, and the difference between the two measurements. 

"bytes currently in cache"

read-once: FALSE         read-once: TRUE       diff: read-once TRUE - FALSE    
   hot cold     hot cold     hot cold
before 1,780,855,108 104,107   before 1,780,855,108 104,107   before 0 0
after 765,809,035 943,253,031   after 1,692,619,364 11,658,137   after 926,810,329 -931,594,894
difference -1,015,046,073 943,148,924   difference -88,235,744 11,554,030   difference 926,810,329 -931,594,894

This can be interpreted as follows:

  • The "hot" collection had 927MB more data in cache with read-once: true than with read-once: false
  • The "cold" collection had 932MB less data in cache with read-once: true than with read-once: false 
  • This shows there is more "hot" data available in the WT cache.

"unmodified pages evicted"

read-once: FALSE         read-once: TRUE       diff: read-once TRUE - FALSE    
  hot cold     hot cold     hot cold
before 0 0   before 0 0   before 0 0
after 34,635 27,460   after 3,377 61,215   after -31,258 33,755
diff 34,635 27,460   diff 3,377 61,215   diff -31,258 33,755

This can be interpreted as follows:

  • The "hot" collection had 31k fewer cache evictions with read-once: true than with read-once: false
  • The "cold" collection had 34k more cache evictions with read-once: true than with read-once: false 
  • This shows there is more "hot" data being kept in cache, though there were more cache evictions overall

Additionally, I mentioned that I ran a single collection scan on the "hot" collection after each test. The results showed with very low variance over 5 trials that collection scans were in fact faster because more "hot" data was already in cache.

read-once: FALSE collection scan time (ms)   read-once: TRUE collection scan time (ms) % change
avg 6325.6   avg 5996.6 -5.20%
min 6229   min 5916 -5.02%
stddev 67.88   stddev 54.34  
  • This 5.2% decrease in scan time is actually significant
  • This shows that even when WT cache misses for this collection scan go entirely to filesystem cache and never disk, there is still a measurable performance improvement.

Conclusion

Given this data, I think it is safe to conclude that read-once cursors do provide some measurable benefit to reduce cache thrashing, and do not appear to provide significant disadvantages. The data implies they behave exactly as we would expect, so my belief is that we should plan on using read-once cursors by default where believe they may be beneficial.

Comment by Daniel Gottlieb (Inactive) [ 03/Nov/18 ]

I also ran louis.williams' linked patch and noticed there are no read requests going to disk during the run. I believe the OS filesystem cache is effectively keeping everything in memory.

Comment by Bruce Lucas (Inactive) [ 03/Nov/18 ]

A couple of other things to consider:

If the hot collection fits entirely in the o/s filesystem cache, then the cost of a WT cache miss may not be that high because it doesn't require i/o. Do you see a large difference in read throughput comparing no index build with index build?

Maybe the high rate of activity on the hot collection keeps it in cache whereas the cold collection gets evicted preferentially because it's only read once. You might see a larger difference if you stop the reads, do the index build, then start the reads again - initially the rate should be lower while the hot collection fills the cache again.

Comment by Daniel Pasette (Inactive) [ 02/Nov/18 ]

Have you examined the ftdc data to confirm that the collection-specific cache stats on the cold table are behaving as you expect? Bruce or Kelsey can show you how to gather those stats and feed them into t2.

Comment by Louis Williams [ 02/Nov/18 ]

I'll summarize here some tests that show insignificant results of using read_once cursors, at least for foreground index builds, which is similar to how future hybrid indexes will behave.

I made a test that builds a foreground index on one "cold" collection and performs random point-lookups on another "hot" collection. The WT cache size was fixed at 1GB, and each collection was filled with docs of size 4k, and as many documents to fill 90% of the cache. I measured the total index build time (ms) and total ops/s across all client threads in that period of time.

I ran 5 trials each of each read_once=true and read_once=false:

readOnce? clientThreads  docs as % of cache size  index build time (stddev) index build time (avg)  % change read ops/s (stddev) read ops/s (avg) % change 
FALSE 16 90% 2,065.14 35260.00 561.46 8506.514286
TRUE 16 90% 4,263.36 36954.00 4.80% 742.81 8401.594595 -1.23%

I analyzed the t2 and can confirm there is less cache pressure in general, at least cases where the cache exceeds 80%, but the results do not show anything significant. 

I am open to ideas for a testing configuration that may better demonstrate improvements, but at least in the scenario we would expect to see improvement, there is none.

Comment by Tess Avitabile (Inactive) [ 01/Nov/18 ]

Thank you for letting me know. I am not too concerned about bitrot for SERVER-36068. We're going to close out the Faster Initial Sync project soon, so we will remove SERVER-36068 from the project.

Comment by Eric Milkie [ 01/Nov/18 ]

I don't expect any further movement on this issue for another two sprints at least. We could expedite it if you feel that the work already staged for SERVER-36068 will bitrot.

Comment by Tess Avitabile (Inactive) [ 01/Nov/18 ]

milkie, when do you expect the Storage team to look into this issue? I'm interested in when we should plan to schedule SERVER-36068.

Generated at Thu Feb 08 04:46:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.