[SERVER-55030] Remove mutexes that serialize secondary and majority read operations Created: 08/Mar/21  Updated: 29/Oct/23  Resolved: 02/Feb/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.3.0

Type: Improvement Priority: Major - P3
Reporter: Louis Williams Assignee: Daniel Gomez Ferro
Resolution: Fixed Votes: 0
Labels: neweng, newgrad
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Execution time histogram.png     File diff.svg     File funclatency.py     PNG File image-2021-11-25-16-53-52-297.png     PNG File image-2021-11-25-16-55-58-447.png     PNG File image-2021-11-25-16-57-45-164.png     PNG File image-2021-11-30-22-49-26-854.png     PNG File image-2021-12-13-14-58-28-825.png     PNG File image-2021-12-13-14-59-44-403.png     PNG File image-2021-12-13-15-06-18-748.png     PNG File image-2021-12-13-15-06-32-867.png     PNG File image-2021-12-13-15-55-26-425.png     PNG File image-2021-12-13-18-34-39-915.png     PNG File image.png     PNG File master-backoff-noflowcontrol.png     PNG File master-backoff.png     PNG File master-no-flow-control.png     PNG File master-semaphore.png     File mutex-collapsed.svg     File nomutex-collapsed.svg     PNG File screenshot-1.png     PNG File screenshot-2.png    
Issue Links:
Backports
Depends
is depended on by SERVER-55024 Complete TODO listed in WT-6709 Closed
Problem/Incident
causes SERVER-63244 Majority reads must round their read ... Closed
Related
related to SERVER-51041 Throttle starting transactions for se... Closed
related to WT-6709 Remove timestamp queues that used to ... Closed
is related to WT-9074 Reconsider spinlocks in transaction I... Open
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Sprint: Execution Team 2021-11-29, Execution Team 2021-12-13, Execution Team 2021-12-27, Execution Team 2022-01-10, Execution Team 2022-01-24, Execution Team 2022-02-07
Participants:
Case:
Linked BF Score: 135

 Description   

MongoDB serializes the creation of new WiredTiger transactions for local secondary reads (i.e. lastApplied reads) and readConcern: majority reads.

WT-6709 recently addressed performance problems in the WiredTiger timestamp queues that should make these mutexes unnecessary. Investigate removing both of these mutexes to potentially support higher read throughput.

See SERVER-51041 for more details.



 Comments   
Comment by Githook User [ 02/Feb/22 ]

Author:

{'name': 'Daniel Gómez Ferro', 'email': 'daniel.gomezferro@mongodb.com', 'username': 'dgomezferro'}

Message: SERVER-55030 Remove mutexes serializing secondary and majority read ops
Branch: master
https://github.com/mongodb/mongo/commit/27387c8cb00e9de161ccced669f09ca9c743546c

Comment by Daniel Gomez Ferro [ 04/Jan/22 ]

The traditional way of dealing with spinlock contention is to have a back-off mechanism. We use WT_PAUSE(); here but that doesn't make the threads ineligible for scheduling, so other Read threads could go back to spinning on the same lock. I ran a quick experiment with a simple linear backoff strategy, making the thread sleep after a few unsuccessful loops on the spinlock: https://spruce.mongodb.com/version/61d2c806850e61439388c2e2/tasks

master with backoff:

master with backoff, no flow control:

There is a very small regression for the 99th latency percentile in some cases, but results are good. I think this approach is more self-adapting compared to having an explicit semaphore and performance wise is on par.

Comment by Daniel Gomez Ferro [ 20/Dec/21 ]

I forgot to mention a concern louis.williams raised if we choose to keep the current mutex: currently we are only using the mutex on two paths, local secondary reads (i.e. lastApplied reads) and readConcern: majority reads, but this bottleneck could be affecting many other paths.

keith.smith asked about the parallelism of the YCSB workload and correctly guessed that it's larger than the number of physical cores. The workload uses 32 threads and the machines have 16 physical cores (32 with HyperThreading, which is disabled by the testing infrastructure). I ran a quick experiment with HyperThreading enabled to verify whether those extra cores could reduce the gap:

Throughput (ops/s) for 50% reads/50% updates Mutex No Mutex Difference
HyperThreading 30703 30775 0.24%
No HyperThreading 25088 24180 -3.62%

This implies read threads are starving update threads and led me to try a "wider gate" that increases read concurrency without starving updates as much, substituting the mutex for a semaphore (hardcoded to 8, half the physical cores): https://spruce.mongodb.com/version/61c052e20ae6064f25ffc3c6/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

master with semaphore:

master with semaphore, no flow control:

For the normal case, using a semaphore reduces the regression to noise level and improves the performance of the read-heavy workloads considerably.

Ideally, if it doesn't affect negatively other workloads, I think we could have the semaphore directly in WT's __txn_get_snapshot_int to reduce the critical section and avoid missing some path that also would benefit from the throttling.

Comment by Daniel Gomez Ferro [ 16/Dec/21 ]

Summary so far: serializing calls to WT_SESSION::begin_transaction (that eventually calls __txn_get_snapshot_int) with a mutex for read-only operations is faster for 50read/50update workloads than letting all read-only operations call WT_SESSION::begin_transaction concurrently.

Removing a mutex in WiredTigerRecoveryUnit::beginTransactionAtLastAppliedTimestamp creates a small regression on YCSB's 50read50update workload. The main difference when profiling both versions is that __txn_get_snapshot_int gets hotter for READ operations, specifically this loop: https://github.com/wiredtiger/wiredtiger/blob/develop/src/txn/txn.c#L304-L326

After a discussion with louis.williams our understanding is that read transactions have to iterate over the list of concurrent write transactions to create their snapshot. When those write transactions are allocating their id, the read transaction spins waiting for the definitive id to be set.

This spin lock is racing against the allocation of the id by the write transaction that happens here: https://github.com/wiredtiger/wiredtiger/blob/develop/src/include/txn_inline.h#L1226-L1231

It looks like the cost of the spinlock and increased concurrency is greater than the overhead of a mutex serializing read operations. This is the ticket that introduced the spinlock https://jira.mongodb.org/browse/WT-4571 , it seems they focused mostly on update-only workloads at the time. There were also some discussions about alternative designs for the spinlock on the pull request: https://github.com/wiredtiger/wiredtiger/pull/4485 (using a bit on the transaction id vs the separate is_allocating field for instance)

Comment by Daniel Gomez Ferro [ 15/Dec/21 ]

I used the attached funclatency.py script (modified from /usr/share/bcc/tools/funclatency) to get an execution time histogram of __txn_get_snapshot_int just for read operations:

With Mutex

[ec2-user@ip-10-2-0-201 ~]$ sudo ./funclatency -p 41475 -d 60 -u '/media/ebs/workdir/mongodb/bin/mongod:__txn_get_snapshot_int'
Creating BPF object
Attaching probes
Attached first
Attached second
Tracing 3 functions for "/media/ebs/workdir/mongodb/bin/mongod:__txn_get_snapshot_int"... Hit Ctrl-C to end.
 
     usecs               : count     distribution
         0 -> 1          : 37       |                                        |
         2 -> 3          : 306171   |***************************             |
         4 -> 7          : 451922   |****************************************|
         8 -> 15         : 7688     |                                        |
        16 -> 31         : 3242     |                                        |
        32 -> 63         : 364      |                                        |
        64 -> 127        : 714      |                                        |
       128 -> 255        : 238      |                                        |
       256 -> 511        : 72       |                                        |
       512 -> 1023       : 56       |                                        |
      1024 -> 2047       : 35       |                                        |
      2048 -> 4095       : 25       |                                        |
      4096 -> 8191       : 20       |                                        |
      8192 -> 16383      : 17       |                                        |
     16384 -> 32767      : 13       |                                        |
     32768 -> 65535      : 1        |                                        |
 
avg = 6 usecs, total: 4853645 usecs, count: 770630

Without Mutex

[ec2-user@ip-10-2-0-201 ~]$ sudo ./funclatency -p 26945 -d 60 -u '/media/ebs/workdir/mongodb/bin/mongod:__txn_get_snapshot_int'
Creating BPF object
Attaching probes
Attached first
Attached second
Tracing 3 functions for "/media/ebs/workdir/mongodb/bin/mongod:__txn_get_snapshot_int"... Hit Ctrl-C to end.
 
     nsecs               : count     distribution
         0 -> 1          : 1257     |                                        |
         2 -> 3          : 286416   |**************************              |
         4 -> 7          : 430875   |****************************************|
         8 -> 15         : 6136     |                                        |
        16 -> 31         : 3883     |                                        |
        32 -> 63         : 721      |                                        |
        64 -> 127        : 118      |                                        |
       128 -> 255        : 194      |                                        |
       256 -> 511        : 533      |                                        |
       512 -> 1023       : 193      |                                        |
      1024 -> 2047       : 98       |                                        |
      2048 -> 4095       : 86       |                                        |
      4096 -> 8191       : 75       |                                        |
      8192 -> 16383      : 108      |                                        |
     16384 -> 32767      : 197      |                                        |
     32768 -> 65535      : 389      |                                        |
     65536 -> 131071     : 180      |                                        |
    131072 -> 262143     : 1        |                                        |
 
avg = 63117 nsecs, total: 46172942306 nsecs, count: 731540
 
Detaching...

Comment by Daniel Gomez Ferro [ 14/Dec/21 ]

Attaching the annotated disassembly of __txn_get_snapshot_int, most time is spent in this loop: https://github.com/wiredtiger/wiredtiger/blob/develop/src/txn/txn.c#L304-L326

        while (s != txn_shared && (id = s->id) != WT_TXN_NONE && WT_TXNID_LE(prev_oldest_id, id) &&
          WT_TXNID_LT(id, current_id)) {
            /*
             * If the transaction is still allocating its ID, then we spin here until it gets its
             * valid ID.
             */
            WT_READ_BARRIER();
            if (!s->is_allocating) {
                // no time spent in here...
            }
            WT_PAUSE();
        }

For comparison here's the annotated version with the mutex:

Comment by Daniel Gomez Ferro [ 13/Dec/21 ]

I've attached the FlameGraph diff ( https://www.brendangregg.com/blog/2014-11-09/differential-flame-graphs.html ) that shows that only that one method saw a significant increase in runtime:

diff.svg

Comment by Daniel Gomez Ferro [ 13/Dec/21 ]

Here's the metrics for concurrent sessions walked, not sure if relevant:

Comment by Daniel Gomez Ferro [ 13/Dec/21 ]

I've attached a flamegraph of the stacktraces I collected during two runs, here's the relevant parts I believe:

With mutex:

Without mutex:

Without the mutex we are spending a lot more time in __txn_get_snapshot_int, in there we walk the array of concurrent transactions spinning if necessary (if the transaction is still allocating its ID): https://github.com/wiredtiger/wiredtiger/blob/develop/src/txn/txn.c#L285-L328

I don't know why this mostly affects the 50read50update workload though, do the OplogAppliers cause contention here? On that side I don't see big differences:

With mutex:

Without mutex:

Comment by Daniel Gomez Ferro [ 03/Dec/21 ]

My initial attempt at reproducing the unexpected performance drop for the 50/50 workload was on my virtual workstation. I was unable to reproduce it, the no-mutex version was a bit faster than the version with the mutex ( ~5400 vs ~5100 ops/s )

Since the performance was much lower than on the sys-perf infrastructure (~5K vs ~24K ops/s) I tried reproducing on multiple EC2 instances. Eventually I realized the automated test was running with defaultWriteConcern: {w: 1, wtimeout: 0}. With that setting performance was more similar ( ~30K ops/s ) but I couldn't reproduce the regression, no-mutex was faster (35K vs 27K ops/s) and had less latency (800 vs 1300 us avg latency).

I'm now re-running using the DSI framework.

Generated at Thu Feb 08 05:35:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.