Identify freeable updates to help eviction

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Cache and Eviction
    • None
    • Storage Engines, Storage Engines - Foundations
    • 29.009
    • None
    • None

      We should consider a way to make freeable update chains more visible to eviction, ftdc, and/or other observability tools. The basic goal is to have a set of metrics, similar to bytes_updates, kept on a page and/or btree.  And they track the update bytes that are actually freeable - that is, they've aged out.  These stats could be collected in a probabilistic manner, and would provide a strong signal to locate those pages that would actually (not theoretically) benefit from eviction  The "aged out" part is the tricky one, a page may be full of updates, even ones that are many minutes old, but they are not freeable if there is a long running transaction that has prevented the oldest_timestamp from moving forward.

      (For some context this was considered as a part of, or immediate followup to, WT-17717 and uses some of the same concepts).

      Start by having 3 numbers - 32 bits is probably enough for each.  One is upper_time that is related to the top 32 bits of the 64 bit timestamps used with WT transaction.  For mongodb, this equates to wall clock seconds.  The other two are byte counts, call one old_updates the other new_updates.  This trio of fields exists on each page (in WT_PAGE_MODIFY). The btree itself only needs to have the old_updates field. The meaning of old_updates is byte counts for updates with timestamps older than upper_time, new_updates is for updates newer than upper_time.

      To make the collection pretty much free and highly performant, we use the countdown session counter from WT-17717 that fires somewhat randomly every 1000 times.  We call it when adding an update, and 999/1000 times it doesn't fire, we spent 3 instructions to find that out and we do nothing special.  1/1000 times, we take the number of bytes we were allocating, multiply by 1000 and atomically add it to new_updates.  Atomically is necessary but there is almost never contention - any other thread adding to that field must also be a 1/1000 winner. Also note, we don't have to multiply by 1000, we can always scale by that amount later.  And I say 1000, because this is 1/1000, we're guessing the total number of bytes in all the updates for the page.

      After the increment, we also look at the upper_time and compare it to upper 32 bits of the global oldest_timestamp. (Aside, for greater performance, lower contention, consider having an additional separate connection field  oldest_upper_timestamp for this use).  If it has changed, then we atomically (+ to be explained later): save the value of new, add the value of new to old, zero out new, advance upper_timestamp.  After we do this, we can also add the saved new (which is now old) to the btree old_updates count.

      So, what we have, for every btree and every modified page, the old_updates field, containing a byte count of freeable bytes.  All those bytes represent bytes older than oldest_timestamp (since our upper_timestamp is a rounded down version of oldest_timestamp.  Yeah, everything is collected 1/1000 and we multiplied by 1000, so it could well be off.  But over time, and with enough updates, things even out and all we're really looking for is ones that are quite large.

      In terms of making this more actionable, we could hold on the btree an array (say sized 20) of pairs (byte ranking, byte key[31]).  When we are updating page stats, we generate some exponential ranking:  i.e. if we have 1k bytes free rank 1, 1.5k rank 2, 2k, rank 3, 3k rank 4....  And we look at the lowest rank in the btree, if we can beat that, we put the key used for the update into the "top 20" list.  We could come up with schemes to make this lock-free, to be able to grow the array (also grow the key size) - using WT generations seems like a natural fit.

      + I handwaved for atomically doing a bunch of things.  I think 2 32 bit counters sitting side by side can be viewed (union) as a 64 bit int, and a CAS instruction can change the entire 64 bits at once.  The upper_timestamp can be advanced separately by the winner of the CAS.  The counts will then be conservative, but again, we don't need exactness.

      Other parts of the design - how the old_updates field gets reset by eviction, are pretty straightforward.

      I know this overlaps some functionality of bytes_update , perhaps both are useful?

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Donald Anderson
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: