Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-54570

Optimize costly atomic increment in Mutex uncontended stats count

    XMLWordPrintableJSON

Details

    • Icon: Improvement Improvement
    • Resolution: Won't Fix
    • Icon: Major - P3 Major - P3
    • None
    • None
    • None
    • None
    • Service Arch

    Description

      The atomic integer increment in multi-core scenario is not cheap. The reason is that the atomic integer increment is a fully consistent RMW operation, which requires one core to obtain exclusive ownership of the cache line in its L1 cache. Details here: https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/

      The benchmarks show that while one thread uncontented atomic increment costs 40 ns, doing it concurrently with 4 cores costs 100 ns. Benchmarks: https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html

      One might think that changing to relaxed memory order might help (doing like var.fetch_add(1, std::memory_order_relaxed); ) - but this is simply not true. The relaxed order means the memory barrier is relaxed relative to other memory locations, but this particular atomic increment is still full RMW and the performance penalty is the same, because the cache lines on all other cores must be invalidated.

      The simplest fix is to sacrifice consistency and do 2-step unrelated modification with relaxed order:

      auto cur = counter.load(std::memory_order_relaxed);
      counter.store(cur + 1, std::memory_order_relaxed);

      We will sacrifice a bit of correctness for probably 5X to 10X performance improvement, or over 100 ns saving per Mutex lock.

      The code I'm talking about is here:
      https://github.com/mongodb/mongo/blob/0fc5b56b12dbdc248556fdfa9da2f44479eac699/src/mongo/platform/mutex.cpp#L112

      We probably can keep fully consistent increment when the lock is contented (_onContendedLock()), because we incur substantial performance penalty anyway.

      Variant 2. The book Is Parallel Programming Hard, And, If So, What Can You Do About It? has the whole chapter 5 about global counters, and shows that worst case cost could go up to milliseconds. The fastest solution they propose in chapter 5.2.4 is eventually consistent thread-local array of counters.

      By my opinion thread-local could be too cumbersome to implement because of all annoying registration code, and potentially huge number of counters, but there is a much simpler alternative (not discussed online but obvious) - sharded counter with random bucketing. The bucketing pick should be made extremely cheap. By my opinion, use TSC (time stamp counter). The portable implementation example here. Then just calculate (TSC % nBuckets) to get a random bucket each time. A periodic process (or just a reader) should swipe all accumulated counters into single integer, and reset all counters back to zero. A reader can always check the first counter - if the value is very small (<10) do not swipe and get a slightly stale result instead.

      Attachments

        Activity

          People

            backlog-server-servicearch Backlog - Service Architecture
            andrew.shuvalov@mongodb.com Andrew Shuvalov (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: