[SERVER-72481] Use relaxed atomic operations for LockStats Created: 03/Jan/23  Updated: 06/Feb/24

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Daniel Gomez Ferro Assignee: Daniel Gomez Ferro
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screenshot 2023-01-03 at 16.53.16.png    
Issue Links:
Depends
Assigned Teams:
Storage Execution
Sprint: Execution Team 2023-03-06, Execution Team 2023-03-20, Execution Team 2023-04-03, Execution Team 2023-04-17, Execution Team 2023-05-01, Execution Team 2023-05-29, Execution Team 2023-06-12, Execution EMEA Team 2023-06-26, Execution EMEA Team 2023-07-10, Execution EMEA Team 2023-07-24, Execution EMEA Team 2023-08-07, Execution EMEA Team 2023-08-21, Execution EMEA Team 2023-09-04, CAR Team 2023-11-13, CAR Team 2023-11-27, CAR Team 2023-12-11, CAR Team 2023-12-25, CAR Team 2024-01-08, CAR Team 2024-01-22, CAR Team 2024-02-05, CAR Team 2024-02-19
Participants:
Linked BF Score: 0

 Description   

SERVER-60049 switched from SingleThreadedLockStats to AtomicLockStats to avoid data races detected by TSAN. At the time a perf run didn't find any regressions, but later on at least one BF found it to be a contributing factor.

The stat counters are only updated by a single thread, so we don't need such strong consistency guarantees. Using relaxed loads, increments and stores we can get the same CPU instructions without creating data races at the C++ memory level: https://godbolt.org/z/cPGT5ex5x

n.fetch_add(1);
// vs
n.store(n.load(std::memory_order_relaxed) + 1, std::memory_order_relaxed);

I ran a quick & dirty perf patch and there's some improvements and some regressions (can't really explain the regressions, worse instruction cache usage? GCC seems to emit more instructions than Clang in the relaxed case).

 



 Comments   
Comment by Jordi Olivares Provencio [ 22/Feb/23 ]

A pattern I see throughout the code is of one thread that owns the actual variable and multiple threads that can read it and allow relaxed reads.

I do wonder if adding a template class that "fakes" being the same as the underlying type could solve this in a more general fashion. I've filed SERVER-74265 to see if this can garner more merit.

Generated at Thu Feb 08 06:21:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.