Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-8670

Improve snapshot creation performance with high rate of concurrent read/write transactions

    • Type: Icon: Improvement Improvement
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None

      For high volumes of short-lived reads and writes, SERVER-55030 showed there's some overhead due to the custom "spin lock" used during snapshot creation: https://github.com/wiredtiger/wiredtiger/blob/322951cb18905cdea2ae3004906c8e8e4e27462a/src/txn/txn.c#L262-L285

      It contends with the transaction ID allocation here: https://github.com/wiredtiger/wiredtiger/blob/ca27d1c1f1c616bf016d0e3854a59b91a5dec908/src/include/txn_inline.h#L1224-L1229

      The performance degradation is around 4% throughput loss for the 50read50update YCSB workload using secondary reads, with 32 threads on a 16 CPU cluster, compared to serializing all snapshot creations with an explicit mutex.

      My understanding is that if an allocating thread gets scheduled out, it could take a long time for it to resume execution because all threads creating a snapshot will be spinning on that loop and consuming all available CPUs.

      My suggestion:

      Instead of relying on WT_PAUSE, add an explicit backoff strategy that schedules out the blocked threads so that the allocating threads can make progress. SERVER-55030 showed that a simple version of this strategy removes the regression for the affected workload.

      Another alternative could be to create the snapshot on a single thread and share it with all concurrent snapshot creations.

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            daniel.gomezferro@mongodb.com Daniel Gomez Ferro
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: