Use extensible address cookies for disaggregated storage and beyond

    • Storage Engines
    • None
    • None
    • 0

      An address cookie is currently used in WiredTiger to encapsulate a reference to a chunk of data on disk. It contains a triple: offset, size and checksum. These are each encoded in a "packed" format using unsigned integers. 1 is added to the offset in the cookie, then multiplied by the allocation size (e.g. 4K) to get the actual offset in the file. Because of that, a zero offset is legal.

      In a number of projects (including tiered storage, and the project to track size and count of items in a btree) we've proposed extending the address cookies in ad hoc ways. Tiered storage appended its extra "objectid" to the triplet of data, making it a quartet. I don't know the mechanism that KeithB used when prototyping the size tracking work. As we are contemplating modifying the address cookie for disaggregated storage, it would be nice to have an extensible, formalized approach. This would help tools that are used to decode and debug, and allow us to mix multiple features that each need a different cookie extension. There's currently no good way to do that.

      As devil's advocate, I'd point out that disaggregated storage, at least, doesn't need compatibility with the old cookie format. Tables are known to be served by DS (or not) and a completely different block manager with a different cookie encoder is employed in the current prototype.  Still I think there strong benefits to having a unified approach, especially one that allows for future extensions.

      The idea is simple - the first item in the current triple, the offset, is currently a packed unsigned integer. We change it to a packed signed integer. This won't impact any current uses - any negative number interpreted as unsigned is something like a number > 2^63, multiplied by the page size. A negative number in the offset position is thus marked an extended cookie.

      The proposal is to have a set of flags, for example:

      #define WT_COOKIE_EXT_DS_REF   0x0001 /* disaggregated storage: (pagenum, checkpoint, lsn, checksum) replaces the triplet */
      #define WT_COOKIE_EXT_TRACK_SIZE 0x0002 /* data size, key count are appended */
      #define WT_COOKIE_TIERED_STORAGE 0x0004 /* append objectid */
      #define WT_COOKIE_NEW_PROJECT 0x0008  /* use great new checksum algorithm */ 

       The flags together are an integer number.  We simply negate that number and store that using the packed encoding.  Following that, we emit the usual triplet (if the flags haven't told us otherwise).  Following the triplet, emit any extra "fields" using some agreed ordering, for example, lower flags have their data appear first.

      One property of this is that smaller flag numbers are more likely to be stored compactly.  I think down to -15 (four bits) can be stored in a single byte for example.  Up to 12 bits can be packed into two bytes.  So we should consider this when assigning bit values.

      Another property is that it is easy to determine if a packed array of bytes is positive or negative, just look at the "sign bit" of the first byte.  The packing algorithm guarantees this encoding, and it allows for a cheap way to determine if a cookie is in the extended format.

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Donald Anderson
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: