Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Won't Fix
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Service Arch
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The atomic integer increment in multi-core scenario is not cheap. The reason is that the atomic integer increment is a fully consistent RMW operation, which requires one core to obtain exclusive ownership of the cache line in its L1 cache. Details here: https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/

The benchmarks show that while one thread uncontented atomic increment costs 40 ns, doing it concurrently with 4 cores costs 100 ns. Benchmarks: https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html

One might think that changing to relaxed memory order might help (doing like var.fetch_add(1, std::memory_order_relaxed); ) - but this is simply not true. The relaxed order means the memory barrier is relaxed relative to other memory locations, but this particular atomic increment is still full RMW and the performance penalty is the same, because the cache lines on all other cores must be invalidated.

The simplest fix is to sacrifice consistency and do 2-step unrelated modification with relaxed order:

auto cur = counter.load(std::memory_order_relaxed);
counter.store(cur + 1, std::memory_order_relaxed);

We will sacrifice a bit of correctness for probably 5X to 10X performance improvement, or over 100 ns saving per Mutex lock.

The code I'm talking about is here:
https://github.com/mongodb/mongo/blob/0fc5b56b12dbdc248556fdfa9da2f44479eac699/src/mongo/platform/mutex.cpp#L112

We probably can keep fully consistent increment when the lock is contented (_onContendedLock()), because we incur substantial performance penalty anyway.

Variant 2. The book Is Parallel Programming Hard, And, If So, What Can You Do About It? has the whole chapter 5 about global counters, and shows that worst case cost could go up to milliseconds. The fastest solution they propose in chapter 5.2.4 is eventually consistent thread-local array of counters.

By my opinion thread-local could be too cumbersome to implement because of all annoying registration code, and potentially huge number of counters, but there is a much simpler alternative (not discussed online but obvious) - sharded counter with random bucketing. The bucketing pick should be made extremely cheap. By my opinion, use TSC (time stamp counter). The portable implementation example here. Then just calculate (TSC % nBuckets) to get a random bucket each time. A periodic process (or just a reader) should swipe all accumulated counters into single integer, and reset all counters back to zero. A reader can always check the first counter - if the value is very small (<10) do not swipe and get a slightly stale result instead.

Assignee:: [DO NOT USE] Backlog - Service Architecture
Reporter:: Andrew Shuvalov (Inactive)
Participants:: [DO NOT USE] Backlog - Service Architecture, Andrew Shuvalov
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Feb 15 2021 01:05:24 AM UTC
Updated:: Dec 06 2022 01:34:38 AM UTC
Resolved:: Feb 23 2021 06:49:44 PM UTC

Details

Description

Attachments

Activity

People

Dates