[SERVER-55030] Remove mutexes that serialize secondary and majority read operations Created: 08/Mar/21 Updated: 29/Oct/23 Resolved: 02/Feb/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 5.3.0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Louis Williams | Assignee: | Daniel Gomez Ferro |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | neweng, newgrad | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||||||||||||||
| Sprint: | Execution Team 2021-11-29, Execution Team 2021-12-13, Execution Team 2021-12-27, Execution Team 2022-01-10, Execution Team 2022-01-24, Execution Team 2022-02-07 | ||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 135 | ||||||||||||||||||||||||||||||||||||
| Description |
|
MongoDB serializes the creation of new WiredTiger transactions for local secondary reads (i.e. lastApplied reads) and readConcern: majority reads.
See |
| Comments |
| Comment by Githook User [ 02/Feb/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Daniel Gómez Ferro', 'email': 'daniel.gomezferro@mongodb.com', 'username': 'dgomezferro'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gomez Ferro [ 04/Jan/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The traditional way of dealing with spinlock contention is to have a back-off mechanism. We use WT_PAUSE(); here but that doesn't make the threads ineligible for scheduling, so other Read threads could go back to spinning on the same lock. I ran a quick experiment with a simple linear backoff strategy, making the thread sleep after a few unsuccessful loops on the spinlock: https://spruce.mongodb.com/version/61d2c806850e61439388c2e2/tasks master with backoff: master with backoff, no flow control: There is a very small regression for the 99th latency percentile in some cases, but results are good. I think this approach is more self-adapting compared to having an explicit semaphore and performance wise is on par. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gomez Ferro [ 20/Dec/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I forgot to mention a concern louis.williams raised if we choose to keep the current mutex: currently we are only using the mutex on two paths, local secondary reads (i.e. lastApplied reads) and readConcern: majority reads, but this bottleneck could be affecting many other paths. keith.smith asked about the parallelism of the YCSB workload and correctly guessed that it's larger than the number of physical cores. The workload uses 32 threads and the machines have 16 physical cores (32 with HyperThreading, which is disabled by the testing infrastructure). I ran a quick experiment with HyperThreading enabled to verify whether those extra cores could reduce the gap:
This implies read threads are starving update threads and led me to try a "wider gate" that increases read concurrency without starving updates as much, substituting the mutex for a semaphore (hardcoded to 8, half the physical cores): https://spruce.mongodb.com/version/61c052e20ae6064f25ffc3c6/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC master with semaphore: master with semaphore, no flow control: For the normal case, using a semaphore reduces the regression to noise level and improves the performance of the read-heavy workloads considerably. Ideally, if it doesn't affect negatively other workloads, I think we could have the semaphore directly in WT's __txn_get_snapshot_int to reduce the critical section and avoid missing some path that also would benefit from the throttling. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gomez Ferro [ 16/Dec/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Summary so far: serializing calls to WT_SESSION::begin_transaction (that eventually calls __txn_get_snapshot_int) with a mutex for read-only operations is faster for 50read/50update workloads than letting all read-only operations call WT_SESSION::begin_transaction concurrently. Removing a mutex in WiredTigerRecoveryUnit::beginTransactionAtLastAppliedTimestamp creates a small regression on YCSB's 50read50update workload. The main difference when profiling both versions is that __txn_get_snapshot_int gets hotter for READ operations, specifically this loop: https://github.com/wiredtiger/wiredtiger/blob/develop/src/txn/txn.c#L304-L326 After a discussion with louis.williams our understanding is that read transactions have to iterate over the list of concurrent write transactions to create their snapshot. When those write transactions are allocating their id, the read transaction spins waiting for the definitive id to be set. This spin lock is racing against the allocation of the id by the write transaction that happens here: https://github.com/wiredtiger/wiredtiger/blob/develop/src/include/txn_inline.h#L1226-L1231 It looks like the cost of the spinlock and increased concurrency is greater than the overhead of a mutex serializing read operations. This is the ticket that introduced the spinlock https://jira.mongodb.org/browse/WT-4571 , it seems they focused mostly on update-only workloads at the time. There were also some discussions about alternative designs for the spinlock on the pull request: https://github.com/wiredtiger/wiredtiger/pull/4485 (using a bit on the transaction id vs the separate is_allocating field for instance) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gomez Ferro [ 15/Dec/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I used the attached funclatency.py
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gomez Ferro [ 14/Dec/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Attaching the annotated disassembly of __txn_get_snapshot_int, most time is spent in this loop: https://github.com/wiredtiger/wiredtiger/blob/develop/src/txn/txn.c#L304-L326
For comparison here's the annotated version with the mutex: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gomez Ferro [ 13/Dec/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've attached the FlameGraph diff ( https://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html ) that shows that only that one method saw a significant increase in runtime:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gomez Ferro [ 13/Dec/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here's the metrics for concurrent sessions walked, not sure if relevant:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gomez Ferro [ 13/Dec/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've attached a flamegraph of the stacktraces I collected during two runs, here's the relevant parts I believe: With mutex:
Without mutex:
Without the mutex we are spending a lot more time in __txn_get_snapshot_int, in there we walk the array of concurrent transactions spinning if necessary (if the transaction is still allocating its ID): https://github.com/wiredtiger/wiredtiger/blob/develop/src/txn/txn.c#L285-L328 I don't know why this mostly affects the 50read50update workload though, do the OplogAppliers cause contention here? On that side I don't see big differences: With mutex:
Without mutex:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gomez Ferro [ 03/Dec/21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
My initial attempt at reproducing the unexpected performance drop for the 50/50 workload was on my virtual workstation. I was unable to reproduce it, the no-mutex version was a bit faster than the version with the mutex ( ~5400 vs ~5100 ops/s ) Since the performance was much lower than on the sys-perf infrastructure (~5K vs ~24K ops/s) I tried reproducing on multiple EC2 instances. Eventually I realized the automated test was running with defaultWriteConcern: {w: 1, wtimeout: 0}. With that setting performance was more similar ( ~30K ops/s ) but I couldn't reproduce the regression, no-mutex was faster (35K vs 27K ops/s) and had less latency (800 vs 1300 us avg latency). I'm now re-running using the DSI framework. |