[SERVER-70041] POC: Optimise available ticket count based on time spent in WiredTiger Created: 28/Sep/22  Updated: 14/Oct/22  Resolved: 14/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Sulabh Mahajan Assignee: Sulabh Mahajan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2022-10-11 at 10.40.55 am.png     PNG File Screen Shot 2022-10-11 at 10.41.05 am.png     PNG File Screen Shot 2022-10-11 at 11.26.31 am.png     PNG File Screen Shot 2022-10-11 at 11.26.45 am.png     PNG File Screen Shot 2022-10-13 at 10.03.19 am.png     PNG File Screen Shot 2022-10-13 at 10.08.54 am.png     PNG File Screen Shot 2022-10-13 at 10.29.06 am.png    
Issue Links:
Depends
Sprint: Execution Team 2022-10-03, Execution Team 2022-10-17
Participants:

 Description   

WiredTiger maintains several statistics that reflect on how long operations are taking inside the storage engine and if the application threads are waiting on page reads or write. We will experiment with adjusting the total available ticket count based on these statistics to optimize between the workload performance and the concurrency into the storage engine.

The relevant statistics that I think could be helpful are the following:

  • Frequency distribution of the operation duration inside WiredTiger - available as a histogram
  • Frequency distribution of the disk latencies - available as a histogram
  • A count of threads waiting at any moment to read pages from the disk or write to the disk.


 Comments   
Comment by Sulabh Mahajan [ 11/Oct/22 ]

Update:

  • Made code changes to trigger a periodic job whose purpose is to query the engine for the "load" and also "recommendations" to change the concurrency.
  • Added methods in the storage engine to support the above queries.
  • For WiredTiger, I am querying for perf histograms and the threads waiting on reads/writes. I have also added to histogram a total operation count.
  • I am still experimenting with the best means to derive a "load score" from the statistics.
  • An initial idea I am toying with is to have weighted buckets - the bucket with higher latency costs more, and the bucket with the least latency costs 0. Then derive a weighted cost of the latency compared to the overall throughput.

I tried the above scheme with YCSB 60 - 100% reads with 128 and 16 threads separately. Here are how they compare from the perspective of the read load score:

The pattern is very clear when seen between the two runs. Next, I want to try adjusting the read tickets in real time if the load score is above a certain value, for instance, 50 or even 10.

The load patterns for the writes are not very clear. They benefit from higher concurrency, but I also realised that since the writes mostly go to the cache and hence the histogram buckets might not be sized correctly to capture a change in latency distribution.

For instance, here is the load score for the writes with 128 and 16 tickets:


Note the load score is being calculated as follows, with bucket 0 (operations faster than 100us) getting a 0 weight. It is effectively a score that summarizes the distribution of the latencies inside WiredTiger and is higher if the distribution shifts to have a larger tail in the collected histograms.

void WTLoadStats::calcLoads() {
    readLoad = ((opReadLatency100_249us * 2.5) + (opReadLatency250_499us * 5) +
                (opReadLatency500_999us * 10) + (opReadLatency1000_9999us * 100) +
                (opReadLatency10000us * 1000)) /
        (readThroughput + 1);
 
    writeLoad = ((opWriteLatency100_249us * 2.5) + (opWriteLatency250_499us * 5) +
                 (opWriteLatency500_999us * 10) + (opWriteLatency1000_9999us * 100) +
                 (opWriteLatency10000us * 1000)) /
        (writeThroughput + 1);
}

Generated at Thu Feb 08 06:15:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.