Solution Space We've Brainstormed
The essential part for MDB is that when data is fetched from WT by the MDB execution layer, MDB can either direct pointers directly at the WT data in WT cache or make its own copy of the data at which to then direct pointers, so that the data remains valid if MDB decides to yield instead of fetching the next document: the data cannot move once MDB starts to use it. There are two problems with the copy solution: 1) making a copy on every getNext of documents was tried during MDB 3.0 development and caused a 2-3x increase in read latency; 2) and then attempting to only copy a document prior to a MDB yield is difficult and imprecise to predict, with the potential for worst case scenario of falling back to copying every document read.
Geert, Ian, Dianna and Keith B hopped on a zoom call to talk about this the other night, Keith being newly introduced to the problem. Some thoughts we had
- Enhance the WT cursor API to have a means of expressing that the page the cursor is pinning is in distress (WT wants to split/evict is) and needs to be unpinned.
- WT changes such that writers can make copies of the data page and continue as usual, ignoring the pinned page that eventually gets deleted when no longer pinned (new readers would pin newer versions of the page data).
- Data copies in the WT layer for readers, perhaps on demand with a flag, would be fine, but again this runs into the issue of how to predict a yield is coming, and how to avoid the worst case scenarios of making a copy of every document read (if yield is predicted but doesn't occur).