Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-10572

Allow column stores to grow in test/format with predictable replay

    • Type: Icon: Improvement Improvement
    • Resolution: Won't Do
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Not Applicable
    • None

      Predictable replay is added to test/format in WT-9915.  Column store inserts are a little complicated with predictable replay, so was left out.  This ticket explores adding that capability.  To be clear, there is some testing of column store with predictable replay (both with var len column store and fixed len column store), we just can't have inserts.

      When we insert a record to a column store, we expand the key space by one.  In a normal test/format one, all the other threads immediately see that and if they are doing an operation on a column store, they use the new upper bound when randomly selecting a key to operate on.  Obviously, depending on the timing of when the CS insert completes in the first thread, there's variability in the upper bound of the CS as seen by the other threads.  Predictable replay needs to remove variability.

      The two strategies we might use to remove variability from predictable replay would be 1) make it impossible to act on the decision or timing by another thread. 2) wait for other threads, that might be acting on something I'm relying on, to get to a known and stable state.  Generally for predictable replay in format we do the first strategy.  We have the concept of lanes, which stripes the set of timestamps (and keys being operated on for that timestamp) into N lanes.  Since only one thread can "own" a lane at once, this prevents a key from being operated upon simultaneously by multiple threads.

      ASIDE: The second strategy, as it relies on waiting, creates difficulties. Here's why. We already have multiple worker threads committing at timestamps, and their steady progression allows the stable timestamp to move forward.  If the stable timestamp gets bogged down, the read timestamps used by the worker threads needs to be further in the past. This makes it more likely that collisions can take place that cause rollbacks.  Add in a another reason for threads to wait on each other, and we may get into a state where threads can only make very slow progress.

      So, here's the germ of a solution by way of an example.  Let's say we have 1024 lanes and we have a single column store table X that asks for 10% inserts for table X, and we have 10 tables.  Then we'll want to do an insert into X 1% of the time.  Assume we continue to abide by the current stipulations of "one can only work in a lane number defined by the lower bits of the timestamp" and "the key number low bits match the lane number".  Now suppose that the last insert into X was key 10240 (multiple of 1024, so done in lane 0).  For simplicity we'll also say the timestamp for that operation was 1024000.  We'll want to do the next insert after 100 operations (TS=1024100) but that's in the wrong lane.  That's because this is a column store, the next insert must be at key 10241, that's lane 1.  So when we get to TS=1024100, we change the current TS to (1024001 + 1024).  That puts us in the right lane (lane 1) and we can insert to key 10241.  It's perfectly legal to advance the timestamp by as much as we like, as long as we do so in a predictable manner.

      So this could "solve" the basic problem by jumping the current timestamp periodically, but there are some issues: this works with a single table, but what if there are multiple tables with different percentages?  What if a single table, but adds up to 53%?  The simple algorithm (insert 1 in K) breaks down.  Can it be done without lots of complexity?  Another problem is that if we have (in the example) a string of 100 operations, then the timestamp jumps by (1025 - 100), 100 more operations, etc.  Then we have clustering around adjacent lanes, and we'll have some potentially weird clustering in the operations on the key space.  It becomes less of a random kind of workload.  Also in the current algorithm, we have an insert exactly every 1 in N operations.  Can we get that "randomized" and still predictable so it varies, but averages out to 1 in N?

      Note: there's (a least) one more way to attack this, using strategy #2.  Maybe we don't wait for the right lane when inserting to a column store.  We insert (we might have to wait in case there's another insert in front of us), because we want our insert to be a particular key, at a particular timestamp.  And we don't "expose" the fact that the column store has grown (to other threads that might be modifying/reading/updating keys), there's a delay of K timestamps before the new key can be used (i.e. before it is factored into the "current upper bound" key number for that column store).  We still may have to wait - another thread may be more than K out in front of the thread that is inserting, so it may need to wait for the inserter to finish if it happens to want to use that key.

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            donald.anderson@mongodb.com Donald Anderson
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: