Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-7106

Increase how often delta encoding is used for history store records

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • WT10.0.1, 4.4.7, 5.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
    • 13
    • Storage - Ra 2021-04-05, Storage - Ra 2021-04-19, Storage - Ra 2021-05-03, Storage - Ra 2021-05-17

      The problem descrption here is:

      Heavy update workloads with a large time window reduce locality in the history store

      We've seen a pair of tickets recently (WT-6924 and a recent HELP ticket) where traffic to the history store increases over time when there are a lot of updates and the durable timestamp is pinned.  I.e., when we keep adding things to the history store without aging older things out.

      This ticket is a placeholder to capture the fact that this might be an issue.  I have a suspicion of why this is happening (below).  But I think we'll need more data on use cases before deciding if this is actually a problem.  We can't optimize for everything.  

      Here's my theory about what's happening.  When there are a lot of updates, there is a good chance that when WiredTiger reconciles a dirty page it will have updates to multiple keys.  and in this scenario WT will add the older values to the history store.  Ideally the keys we need to update in the HS will also fall on just one or two pages—since the keys in the history store are typically sorted in the same order as in the active btree.  

      But over time, more and more updates accumulate in the history store (because we're not moving the durable timestamp).  So the amount of space required for each key in the history store goes up.  As a result, the number of unique keys stored on each page in the history store file goes down.  This means that as the history store grows, reconciliation will need more distinct pages from the history store to update the same set of keys.  

      To make this concrete, suppose every update requires 100 bytes in the history store, pages are 4KB, and we have a set of sequentially numbered keys from 1 to N. If every key has a single old value stored in the history store, then a page of the history store hold information about 40 keys.  So if reconciliation needs to update keys 5, 15, 25, and 35, there is a good chance they are all on the same history store page. If we keep updating all of the keys until there are 10 old values for each key in the history store, then we will have 1000 bytes per key in the history store, and each history store page will have information about only 4 keys.  At this point, the same reconciliation that updates keys 5, 15, 25, and 35 will likely need 4 different pages from the history store. 

        1. mod100_all.png
          84 kB
          Alison Felizzi
        2. mod15_all.png
          81 kB
          Alison Felizzi
        3. mod15_all-1.png
          81 kB
          Alison Felizzi
        4. mod150_all.png
          85 kB
          Alison Felizzi
        5. mod200_all.png
          82 kB
          Alison Felizzi
        6. mod25_all.png
          81 kB
          Alison Felizzi
        7. mod50_all.png
          81 kB
          Alison Felizzi
        8. moddevelop_all.png
          87 kB
          Alison Felizzi
        9. no_reverse_modifies.png
          224 kB
          Alison Felizzi
        10. overall_stage_delta.png
          182 kB
          Alison Felizzi
        11. overall_stage_nodelta.png
          194 kB
          Alison Felizzi
        12. overall_stage_nodelta-1.png
          194 kB
          Alison Felizzi
        13. reverse_modifies.png
          555 kB
          Alison Felizzi
        14. test_reverse.py
          6 kB
          Etienne Petrel
        15. tx2_read_nodeleta_2.png
          112 kB
          Alison Felizzi
        16. txn_read_delta_3.png
          52 kB
          Alison Felizzi
        17. txn_read_delta.png
          122 kB
          Alison Felizzi
        18. txn_read_nodelta_3.png
          54 kB
          Alison Felizzi
        19. write_stage_delta.png
          206 kB
          Alison Felizzi
        20. write_stage_nodelta.png
          191 kB
          Alison Felizzi

            Assignee:
            etienne.petrel@mongodb.com Etienne Petrel
            Reporter:
            keith.smith@mongodb.com Keith Smith
            Votes:
            0 Vote for this issue
            Watchers:
            15 Start watching this issue

              Created:
              Updated:
              Resolved: