Parallel checkpoint

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Checkpoints
    • None
    • Storage Engines, Storage Engines - Foundations
    • None
    • 13
    • 0

      Currently the checkpoint process is always performed by one thread only which iterates over the list of dirty btrees and processes them one after another. That's a good point for a multithreaded execution since this process is almost completely independent and so can be done simultaneously. 

      This activity was started as a part of Skunkworks (two years in a row), so we already have POC PR that contains the following approach:

      • Main checkpoint thread starts checkpoint
      • It iterates over all dirty dhandles and puts them to the mutex-protected queue
      • After this it signals the worker threads that there are some btrees to reconcile
      • Each worker thread that was lucky enough to get a btree to process, processes it and then reports an information to update metadata back to the main thread
      • Main thread updates metadata accordingly (worth noticing that metadata update should be done as one txn, so all the updates are performed by the main thread)

      Since the PR is in an early POC stage, it is far from production quality. I’d like to outline the next steps as I currently see them:

      • We have not yet reached complete CI greenness:
        • PR passess all the checkpoint tests, but fails in some places in CI (mostly in tiered storage related tests)
        • For some tests it takes significantly longer time to pass than in develop:
      • Possible implementation/design improvements:
        • Since we always know how many dhandles we have to checkpoint, we can preallocate memory for the shared queue to make it lock free. However this approach can have it's own performance drawbacks.
        • WT already creates threads for eviction and with this patch there are more WT created threads for checkpointing. So it is worth considering how all these threads affect system utilization and whether it causes system (under)oversubscription in some cases. The approach can be creation of a shared thread pool for both eviction/checkpoint. 
        • If a checkpoint doesn't have enough job to do, doing it simultaneously can become significantly slower than it's sequential version. So it would be useful to support both options and some mechanism to distinguish whether a certain checkpoint should be done by single or multiple threads. Off the top of my head, this heuristic could include the number of dirty Btrees plus the average number of dirty pages per Btree.
        • All other comments in the PR that are started by // to fire CI warnings and marked by ??? to discuss them later
      • Even without all these improvements PR shows significant improvement on both x86 and ARM for "checkpoint-stress - Update", but it also causes degradations in some other benchmarks. I didn’t have enough time to determine whether all these degradations are consistent or not, so it’s worth reevaluating the performance.

        1. parallel-checkpoint-v1-arm.pdf
          15.60 MB
          Ivan Kochin
        2. parallel-checkpoint-v1-x86.pdf
          15.67 MB
          Ivan Kochin

            Assignee:
            Ivan Kochin
            Reporter:
            Ivan Kochin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: