[SERVER-59240] Review WiredTiger default settings for engine and collections. Created: 11/Aug/21  Updated: 23/Jan/24

Status: Blocked
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Luke Pearson Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: or
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to WT-11173 Reduce the default dirty cache limit ... Open
Assigned Teams:
Storage Execution
Sprint: Execution Team 2022-02-21, Execution Team 2022-03-07, Execution Team 2022-03-21
Participants:
Case:

 Description   

The execution layer opens WiredTiger with an eviction thread max and min of 4. Which was decided ~6.5 years ago in SERVER-16602. Given that a lot has changed in the code base since then we should review the default settings for the connection open to WiredTiger, and potentially review the collection create settings.

Additionally from a technological standpoint machines are generally faster and have more resources available. I am specifically interested in the eviction threads min / max configuration. Stressful workloads on large machines could utilize more than that, which would avoid pulling application threads into eviction as frequently.

Other values of interest are:

  • Eviction targets, dirty target, clean target.
  • Leaf page max
  • Memory page max
  • Prefix_compression
  • Split_pct

I think since SERVER-16602 a lot of these configurations are now exposed to users and can be configured manually as such they aren't as interesting for review.

As for the work here, I think we'd need to tune the values, run perf tests, collate the data and then make a decision for which value is best. This is potentially a lot of work and could be split into tickets for each configuration.



 Comments   
Comment by Daniel Gomez Ferro [ 31/Mar/22 ]

Sorry for the (very) late answer daniel.gottlieb. My understanding is that there's only one thread (the BackgroundSync) reading data from the primary and feeding it to the replication workers through the OplogBuffer, so the odds are even worse in that regard (1 vs 1024 threads). I ran a quick experiment with execution control enabled (PM-1723) that uses a FIFO queue to order WT operations and hopefully produce fairer results and avoid starvations, however the problem was still reproducible. It's possible that the starvation happened at other layers though.

In any case, we decided to put this ticket into the backlog to investigate it properly at a later time since the benefits weren't immediate.

Comment by Daniel Gottlieb (Inactive) [ 11/Mar/22 ]

I couldn't figure out why the durable lag increases consistently.

You've already done much more research on MDB in this area than I have, so apologies if this isn't a useful idea to keep in mind. But in case the thought hasn't been propagated lately – it's typically assumed that primaries perform better than secondaries. I'm not sure how one would best isolate this, but I wonder if increasing the number of eviction threads has the consequence of starving replication worker threads. IIRC we use 8 or 16 replication worker threads (but notably, this is a fixed number much much smaller than the 1024 clients that are vying for the primary's attention).

Comment by Daniel Gomez Ferro [ 11/Mar/22 ]

Our performance tests run with 8 cores, so I focused on the build that sets eviction threads = number of cores.

I investigated one of the regressions, ParallelInsert-1024.Insert_W1_JTrue.34 that had -32% throughput.

In this test we used to have durable lag (and hence replication lag) spiking up to 3 or 4 seconds at the start of some phases (for W1_JTrue and W1). With the increased eviction threads the durable lag increased to 5s and flow control kicked in, creating a large performance regression due to the high concurrency of the test (1024 threads). I couldn't figure out why the durable lag increases consistently.

Another large regression happened on YCSB 60GB , -35% ops_per_sec during load.

In this test it looks like there's cache thrashing at the WT level, threads spend more time reading data from disk into the cache, possibly because we are evicting pages more aggressively.

Comment by Daniel Gomez Ferro [ 03/Mar/22 ]

Many workloads improved when setting the number of eviction threads to the number of cores, but there are some significant regressions too, specially for high latency percentiles: https://dag-metrics-webapp.server-tig.staging.corp.mongodb.com/perf-analyzer-viz/?evergreen_version=621f84999ccd4e75c3af2581&evergreen_base_version=sys_perf_ae0c9cf8327d54470175ac8a450df8f08e77578a

I'm running another test with eviction worker threads set to half the available cores, to see if it would help with those specific workloads.

Comment by Louis Williams [ 02/Mar/22 ]

daniel.gomezferro, let's raise the maximum number of eviction worker threads to the minimum of the number of available CPU cores and 20 (i.e. the maximum). And then we can run our performance workloads and see if there are any significant regressions. We should also open another ticket to consider changing these other parameters since that investigation will likely require much more analysis and time.

CC josef.ahmad

Comment by Luke Pearson [ 11/Aug/21 ]

I can try and dig up some help tickets with a stressed cache if that adds value to this ticket. I do understand that this would be fairly large chunk of work so if there isn't a need for it then this ticket can be closed or de-prioritized.

 

Generated at Thu Feb 08 05:46:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.