[SERVER-40250] High contention for ReplicationCoordinatorImpl::_mutex in w:majority workloads Created: 20/Mar/19  Updated: 05/Dec/23  Resolved: 09/Oct/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Andy Schwerin Assignee: Backlog - Replication Team
Resolution: Duplicate Votes: 0
Labels: dmd-perf
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-43135 Introduce a future-based API for wait... Closed
Related
is related to SERVER-31694 17% throughput regression in insert w... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:
Case:

 Description   

bartle reports high contention for the replication coordinator mutex in heavy insert workloads with w:majority writes, which leads to low CPU utilization and bottlenecking on a synthetic resource (the mutex). This is problematic on deployments with many cores, but can even be a problem on 16-core machines, as he mentions in a comment on another ticket.

Shortening the critical section under the mutex in setMyLastAppliedOpTimeForward and particularly in _wakeReadyWaiters_inlock is one possible approach to mitigating the problem. Finer grained locking around waiters might be another.



 Comments   
Comment by Lingzhi Deng [ 09/Oct/19 ]

SERVER-43135 introduced a future-based API for writeConcern waiting to reduce contention on the ReplicationCoordinatorImpl _mutex. Closing as a duplicate.

Comment by David Bartley [ 09/Oct/19 ]

Thanks for the update!

Comment by Lingzhi Deng [ 09/Oct/19 ]

Hi bartle, thanks for the report regarding {w: majority} performance. We have done some performance testing using an insert workload similar to the one you suggested and we were able to see contention on the ReplicationCoordinatorImpl _mutex.

The current implementation of {w: majority} involves journaling and awaitReplication. So, the server handling concurrent {w: majority} workloads can be loosely modeled as a system with 3 queues - CPU, journaling, and replication, with each of the three queues contending for resources. In a closed system (assuming that was the way you ran the tests), ~7% CPU utilization on a 16-core machine doesn't necessarily suggest contention on a single core. We believe that CPU is just not saturated under the overall throughput of the closed system. But it does suggest long service time (likely due to contentions) on either journaling or replication, so we have also done some profiling work for this and we found that the journaling queue dominates the overall time needed to service {w: majority} writes. While the ReplicationCoordinatorImpl _mutex is a hot mutex in the replication subsystem, in a closed system, the asymptotic bounds for a closed system is determined by the slowest service (based on this book Section 7.2 Asymptotic Bounds for Closed Systems). We think the low CPU utilization and bad performance were mostly due to the overall throughput dominated by journaling.

That said, work has been done (mostly in SERVER-43135) to reduce contention on the ReplicationCoordinatorImpl _mutex by introducing a future-based API to relieve the thundering herd effect due to {w: majority} waiters waking up at the same time. As part of that ticket, we also sort waiters based on OpTime to avoid unnecessary computation in _wakeReadyWaiters_inlock as you suggested. We have also done SERVER-43252, SERVER-43307 and SERVER-43769 to shorten the critical path under the mutex.

In a closed system, if a non-slowest service improves, it has marginal impact on throughput or mean response time. Thus, after the work listed above, we didn't see much improvement in the overall throughput in our tests for {w: majority} (default j: true) workloads. However, we did see 20% - 40% improvement in {w: majority, j: false} workloads when running with 256 client threads or more. This is because {w: majority, j: false} workloads do not have journaling but suffer from replication mutex contention the most. Section 7.3 on Page 118 in this book gives a similar example to what happens if the non-bottleneck part of a system improves.

We already have proposals to optimize the way we journal client writes. For more details, see SERVER-43417. We have done a proof of concept for the optimization and we were able to see 20% - 70% gain for {w: majority} workloads with >= 128 client threads. We will consider it after the “Replicate Before Journaling” project (SERVER-41392).

Comment by David Bartley [ 21/Mar/19 ]

Another thing that'd be nice would be to expose lock metrics (lock held time, wait time, etc...) for these low-level mutexes, similar to what's exposed for MongoDB-style multi-granular locks.

Comment by David Bartley [ 20/Mar/19 ]

Yeah, I agree this particular code hasn't really changed from 3.4 to master.

Comment by David Bartley [ 20/Mar/19 ]

This was an insert load with 256 threads on 3.4.  The inserts themselves were pretty small (~1k documents with a few fields), and were just single-document inserts.

Comment by Andy Schwerin [ 20/Mar/19 ]

bartle@stripe.com, in addition to the information you've already supplied, I'm curious to know approximately how many simultaneously executing client threads your workload uses. This will make it easier for us to compare it to our existing performance workloads, in case they need to be extended to cover this case. If you can share code for a representative workload, that would of course be valuable, but I don't think it's strictly required in this case. In any event, please watch this ticket to track the issue, rather than SERVER-31694.

Oh, on which version of MongoDB did you perform this analysis? The core implementation of waking waiters hasn't changed much in the last 4 or 5 years, so I imagine the basic problem exists on all versions, but it may help to know.

The ReplicationCoordinator is a bit of a kitchen sink of functionality today, and breaking it up into logical pieces is going to be an important part of making a maintainable system of tracking write concern satisfaction that scales to higher core counts efficiently. I'm hesitant to endorse a solution with reader-writer locks, as the frequency of writes under the existing mutex is quite high, but longer term I imagine a finer-grained locking solution will be important. In the short term, restructuring the wake-up logic as you suggest might be workable.

Comment by Andy Schwerin [ 20/Mar/19 ]

bartle@stripe.com's comment on SERVER-31694:

Are there plans to improve performance of setMyLastAppliedOpTimeForward?  On a write-majority, insert-heavy workflow we basically see single-core contention on setMyLastAppliedOpTimeForward (based on a CPU profile).  That particular function takes an exclusive mutex, so it's unsurprising that if you push enough write-majority writes through you'd contend on a single core (in practice we're hitting a bottleneck of 12k wps, on a 16-core machine, with ~7% CPU usage).

Ultimately. all of the CPU ends up in _wakeReadyWaiters_inlock.  That particular implementation seems rather naive; it ends up recomputing a bunch of things (again, under a global, exclusive lock) for every replication waiter.  Instead, it seems like you should structure this code such that it determines the largest optime that satisfies the various write concern modes (basically "majority" and w="N") once, and then pass that information down into _doneWaitingForReplication_inlock.

Beyond this, reading through the code, it's fairly concerning how coarse-grained _mutex on ReplicationCoordinatorImpl is.  Is there a reason more work hasn't been invested in finer-grained locks, or even reader-writer locks?  As-is, it's really difficult to make any perf improvements.

Generated at Thu Feb 08 04:54:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.