[SERVER-40250] High contention for ReplicationCoordinatorImpl::_mutex in w:majority workloads Created: 20/Mar/19 Updated: 05/Dec/23 Resolved: 09/Oct/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Andy Schwerin | Assignee: | Backlog - Replication Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | dmd-perf | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Description |
|
bartle reports high contention for the replication coordinator mutex in heavy insert workloads with w:majority writes, which leads to low CPU utilization and bottlenecking on a synthetic resource (the mutex). This is problematic on deployments with many cores, but can even be a problem on 16-core machines, as he mentions in a comment on another ticket. Shortening the critical section under the mutex in setMyLastAppliedOpTimeForward and particularly in _wakeReadyWaiters_inlock is one possible approach to mitigating the problem. Finer grained locking around waiters might be another. |
| Comments |
| Comment by Lingzhi Deng [ 09/Oct/19 ] |
|
|
| Comment by David Bartley [ 09/Oct/19 ] |
|
Thanks for the update! |
| Comment by Lingzhi Deng [ 09/Oct/19 ] |
|
Hi bartle, thanks for the report regarding {w: majority} performance. We have done some performance testing using an insert workload similar to the one you suggested and we were able to see contention on the ReplicationCoordinatorImpl _mutex. The current implementation of {w: majority} involves journaling and awaitReplication. So, the server handling concurrent {w: majority} workloads can be loosely modeled as a system with 3 queues - CPU, journaling, and replication, with each of the three queues contending for resources. In a closed system (assuming that was the way you ran the tests), ~7% CPU utilization on a 16-core machine doesn't necessarily suggest contention on a single core. We believe that CPU is just not saturated under the overall throughput of the closed system. But it does suggest long service time (likely due to contentions) on either journaling or replication, so we have also done some profiling work for this and we found that the journaling queue dominates the overall time needed to service {w: majority} writes. While the ReplicationCoordinatorImpl _mutex is a hot mutex in the replication subsystem, in a closed system, the asymptotic bounds for a closed system is determined by the slowest service (based on this book Section 7.2 Asymptotic Bounds for Closed Systems). We think the low CPU utilization and bad performance were mostly due to the overall throughput dominated by journaling. That said, work has been done (mostly in In a closed system, if a non-slowest service improves, it has marginal impact on throughput or mean response time. Thus, after the work listed above, we didn't see much improvement in the overall throughput in our tests for {w: majority} (default j: true) workloads. However, we did see 20% - 40% improvement in {w: majority, j: false} workloads when running with 256 client threads or more. This is because {w: majority, j: false} workloads do not have journaling but suffer from replication mutex contention the most. Section 7.3 on Page 118 in this book gives a similar example to what happens if the non-bottleneck part of a system improves. We already have proposals to optimize the way we journal client writes. For more details, see |
| Comment by David Bartley [ 21/Mar/19 ] |
|
Another thing that'd be nice would be to expose lock metrics (lock held time, wait time, etc...) for these low-level mutexes, similar to what's exposed for MongoDB-style multi-granular locks. |
| Comment by David Bartley [ 20/Mar/19 ] |
|
Yeah, I agree this particular code hasn't really changed from 3.4 to master. |
| Comment by David Bartley [ 20/Mar/19 ] |
|
This was an insert load with 256 threads on 3.4. The inserts themselves were pretty small (~1k documents with a few fields), and were just single-document inserts. |
| Comment by Andy Schwerin [ 20/Mar/19 ] |
|
bartle@stripe.com, in addition to the information you've already supplied, I'm curious to know approximately how many simultaneously executing client threads your workload uses. This will make it easier for us to compare it to our existing performance workloads, in case they need to be extended to cover this case. If you can share code for a representative workload, that would of course be valuable, but I don't think it's strictly required in this case. In any event, please watch this ticket to track the issue, rather than Oh, on which version of MongoDB did you perform this analysis? The core implementation of waking waiters hasn't changed much in the last 4 or 5 years, so I imagine the basic problem exists on all versions, but it may help to know. The ReplicationCoordinator is a bit of a kitchen sink of functionality today, and breaking it up into logical pieces is going to be an important part of making a maintainable system of tracking write concern satisfaction that scales to higher core counts efficiently. I'm hesitant to endorse a solution with reader-writer locks, as the frequency of writes under the existing mutex is quite high, but longer term I imagine a finer-grained locking solution will be important. In the short term, restructuring the wake-up logic as you suggest might be workable. |
| Comment by Andy Schwerin [ 20/Mar/19 ] |
|
bartle@stripe.com's comment on
|