Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Done
Priority: Major - P3
Fix Version/s: WT11.0.0, 6.1.0-rc0
Affects Version/s: None
Component/s: None
Labels:

Sprint:
Storage - Ra 2022-02-21, Storage - Ra 2022-03-07, Storage - Ra 2022-03-21, Storage - Ra 2022-04-04, Storage - Ra 2022-04-18, Storage - Ra 2022-05-02
Story Points:
13

Use Case

In PM-2451 we're trying to change mongodb read operations to keep valid cursor memory across resource yields. Yield normally drops MDB locks and releases WT resources (ends the WT transaction via WT::rollback_transaction). After a yield MDB reacquires locks and WT resources (WT::begin_transaction). These operations do no writes, just reads.

The MDB layer has many raw pointers directly into the WT cursor's current document, and we wish to keep these ptrs valid across yield. Instead of arduously having each pointer copy (potentially in duplicate) pieces of the data to which they point, all across the many query plan stages. Therefore, PM-2451 changes the WT::rollback_transaction call to WT::commit_transaction, pinning the underlying WT page in memory so that MDB pointers can still access the last document retrieved from the WT cursor.

Problem

We turned the feature on and encountered a significant performance regression: BF-24012. A workload running concurrent read and update operations saw a ~50% increase in read throughput and a ~80% decrease in write throughput (1/5 the number of writes). We looked at the accompanying FTDC data and it appears the problem is not memory pressure or CPU. Rather, the issue lies in the writer threads failing to split/evict the pages pinned by the reader threads. We theorize that there are so many concurrent readers that pages become constantly pinned, even though readers come and go, such that writer threads rarely have the opportunity to modify pages that become too large: effectively the readers are locking out writers.

I've attached FTDC data for the performance change we encountered (774fe7d-regression-metrics*) as well as a base comparison of the workload before we turned the feature on (1a5150c-base-metrics*). Also attached is a screen shot of the metrics we thought relevant (non-standard and different than the base comparison) and on which we based our above hypothesis of the problem.

Solution Space We've Brainstormed

The essential part for MDB is that when data is fetched from WT by the MDB execution layer, MDB can either direct pointers directly at the WT data in WT cache or make its own copy of the data at which to then direct pointers, so that the data remains valid if MDB decides to yield instead of fetching the next document: the data cannot move once MDB starts to use it. There are two problems with the copy solution: 1) making a copy on every getNext of documents was tried during MDB 3.0 development and caused a 2-3x increase in read latency; 2) and then attempting to only copy a document prior to a MDB yield is difficult and imprecise to predict, with the potential for worst case scenario of falling back to copying every document read.

Geert, Ian, Dianna and Keith B hopped on a zoom call to talk about this the other night, Keith being newly introduced to the problem. Some thoughts we had

Enhance the WT cursor API to have a means of expressing that the page the cursor is pinning is in distress (WT wants to split/evict is) and needs to be unpinned.
WT changes such that writers can make copies of the data page and continue as usual, ignoring the pinned page that eventually gets deleted when no longer pinned (new readers would pin newer versions of the page data).
Data copies in the WT layer for readers, perhaps on demand with a flag, would be fine, but again this runs into the issue of how to predict a yield is coming, and how to avoid the worst case scenarios of making a copy of every document read (if yield is predicted but doesn't occur).

Goal

We would like the storage engine team's advice and expertise on whether any changes might be made to the WT layer to improve things for the use case described.