[SERVER-54570] Optimize costly atomic increment in Mutex uncontended stats count Created: 15/Feb/21 Updated: 06/Dec/22 Resolved: 23/Feb/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Andrew Shuvalov (Inactive) | Assignee: | Backlog - Service Architecture |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Service Arch
|
| Participants: |
| Description |
|
The atomic integer increment in multi-core scenario is not cheap. The reason is that the atomic integer increment is a fully consistent RMW operation, which requires one core to obtain exclusive ownership of the cache line in its L1 cache. Details here: https://fgiesen.wordpress.com/2014/08/18/atomics-and-contention/ The benchmarks show that while one thread uncontented atomic increment costs 40 ns, doing it concurrently with 4 cores costs 100 ns. Benchmarks: https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html One might think that changing to relaxed memory order might help (doing like var.fetch_add(1, std::memory_order_relaxed); ) - but this is simply not true. The relaxed order means the memory barrier is relaxed relative to other memory locations, but this particular atomic increment is still full RMW and the performance penalty is the same, because the cache lines on all other cores must be invalidated. The simplest fix is to sacrifice consistency and do 2-step unrelated modification with relaxed order: auto cur = counter.load(std::memory_order_relaxed); We will sacrifice a bit of correctness for probably 5X to 10X performance improvement, or over 100 ns saving per Mutex lock. The code I'm talking about is here: We probably can keep fully consistent increment when the lock is contented (_onContendedLock()), because we incur substantial performance penalty anyway. Variant 2. The book Is Parallel Programming Hard, And, If So, What Can You Do About It? has the whole chapter 5 about global counters, and shows that worst case cost could go up to milliseconds. The fastest solution they propose in chapter 5.2.4 is eventually consistent thread-local array of counters. By my opinion thread-local could be too cumbersome to implement because of all annoying registration code, and potentially huge number of counters, but there is a much simpler alternative (not discussed online but obvious) - sharded counter with random bucketing. The bucketing pick should be made extremely cheap. By my opinion, use TSC (time stamp counter). The portable implementation example here. Then just calculate (TSC % nBuckets) to get a random bucket each time. A periodic process (or just a reader) should swipe all accumulated counters into single integer, and reset all counters back to zero. A reader can always check the first counter - if the value is very small (<10) do not swipe and get a slightly stale result instead. |
| Comments |
| Comment by Andrew Shuvalov (Inactive) [ 16/Feb/21 ] |
|
I've realized that different Mutexes do not share same stat counter so what I described above is not necessary at this stage. I actually would like to have a global lock contention monitor as part of better feedback loop on our thread pool sizes but this is outside of the scope of one ticket. I actually plan to make a presentation about thread pool optimization this Friday 2/19 at 10 am, where I will touch this topic. It looks like even though we track per-muter lock contention we don't do anything about it, I don't see any code invoking `getDiagnosticListenerState()`. Looks like unfinished effort in the right direction. Please read the ticket description just for your information and feel free to close it with "will not fix", and please check at my presentation 2/19. |