Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7106

Increase how often delta encoding is used for history store records

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • WT10.0.1, 4.4.7, 5.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
    • 13
    • Storage - Ra 2021-04-05, Storage - Ra 2021-04-19, Storage - Ra 2021-05-03, Storage - Ra 2021-05-17

      The problem descrption here is:

      Heavy update workloads with a large time window reduce locality in the history store

      We've seen a pair of tickets recently (WT-6924 and a recent HELP ticket) where traffic to the history store increases over time when there are a lot of updates and the durable timestamp is pinned.  I.e., when we keep adding things to the history store without aging older things out.

      This ticket is a placeholder to capture the fact that this might be an issue.  I have a suspicion of why this is happening (below).  But I think we'll need more data on use cases before deciding if this is actually a problem.  We can't optimize for everything.  

      Here's my theory about what's happening.  When there are a lot of updates, there is a good chance that when WiredTiger reconciles a dirty page it will have updates to multiple keys.  and in this scenario WT will add the older values to the history store.  Ideally the keys we need to update in the HS will also fall on just one or two pages—since the keys in the history store are typically sorted in the same order as in the active btree.  

      But over time, more and more updates accumulate in the history store (because we're not moving the durable timestamp).  So the amount of space required for each key in the history store goes up.  As a result, the number of unique keys stored on each page in the history store file goes down.  This means that as the history store grows, reconciliation will need more distinct pages from the history store to update the same set of keys.  

      To make this concrete, suppose every update requires 100 bytes in the history store, pages are 4KB, and we have a set of sequentially numbered keys from 1 to N. If every key has a single old value stored in the history store, then a page of the history store hold information about 40 keys.  So if reconciliation needs to update keys 5, 15, 25, and 35, there is a good chance they are all on the same history store page. If we keep updating all of the keys until there are 10 old values for each key in the history store, then we will have 1000 bytes per key in the history store, and each history store page will have information about only 4 keys.  At this point, the same reconciliation that updates keys 5, 15, 25, and 35 will likely need 4 different pages from the history store. 

        1. mod100_all.png
          mod100_all.png
          84 kB
        2. mod15_all.png
          mod15_all.png
          81 kB
        3. mod15_all-1.png
          mod15_all-1.png
          81 kB
        4. mod150_all.png
          mod150_all.png
          85 kB
        5. mod200_all.png
          mod200_all.png
          82 kB
        6. mod25_all.png
          mod25_all.png
          81 kB
        7. mod50_all.png
          mod50_all.png
          81 kB
        8. moddevelop_all.png
          moddevelop_all.png
          87 kB
        9. no_reverse_modifies.png
          no_reverse_modifies.png
          224 kB
        10. overall_stage_delta.png
          overall_stage_delta.png
          182 kB
        11. overall_stage_nodelta.png
          overall_stage_nodelta.png
          194 kB
        12. overall_stage_nodelta-1.png
          overall_stage_nodelta-1.png
          194 kB
        13. reverse_modifies.png
          reverse_modifies.png
          555 kB
        14. test_reverse.py
          6 kB
        15. tx2_read_nodeleta_2.png
          tx2_read_nodeleta_2.png
          112 kB
        16. txn_read_delta_3.png
          txn_read_delta_3.png
          52 kB
        17. txn_read_delta.png
          txn_read_delta.png
          122 kB
        18. txn_read_nodelta_3.png
          txn_read_nodelta_3.png
          54 kB
        19. write_stage_delta.png
          write_stage_delta.png
          206 kB
        20. write_stage_nodelta.png
          write_stage_nodelta.png
          191 kB

            Assignee:
            etienne.petrel@mongodb.com Etienne Petrel
            Reporter:
            keith.smith@mongodb.com Keith Smith
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated:
              Resolved: