<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 04:54:27 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[SERVER-40250] High contention for ReplicationCoordinatorImpl::_mutex in w:majority workloads</title>
                <link>https://jira.mongodb.org/browse/SERVER-40250</link>
                <project id="10000" key="SERVER">Core Server</project>
                    <description>&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=bartle&quot; class=&quot;user-hover&quot; rel=&quot;bartle&quot;&gt;bartle&lt;/a&gt; reports high contention for the replication coordinator mutex in heavy insert workloads with &lt;tt&gt;w:majority&lt;/tt&gt; writes, which leads to low CPU utilization and bottlenecking on a synthetic resource (the mutex). This is problematic on deployments with many cores, but can even be a problem on 16-core machines, as he mentions in a comment on another ticket.&lt;/p&gt;

&lt;p&gt;Shortening the critical section under the mutex in &lt;tt&gt;setMyLastAppliedOpTimeForward&lt;/tt&gt; and particularly in &lt;tt&gt;_wakeReadyWaiters_inlock&lt;/tt&gt; is one possible approach to mitigating the problem. Finer grained locking around waiters might be another.&lt;/p&gt;</description>
                <environment></environment>
        <key id="720222">SERVER-40250</key>
            <summary>High contention for ReplicationCoordinatorImpl::_mutex in w:majority workloads</summary>
                <type id="1" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14703&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.mongodb.org/images/icons/priorities/major.svg">Major - P3</priority>
                        <status id="6" iconUrl="https://jira.mongodb.org/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="backlog-server-repl">Backlog - Replication Team</assignee>
                                    <reporter username="schwerin@mongodb.com">Andy Schwerin</reporter>
                        <labels>
                            <label>dmd-perf</label>
                    </labels>
                <created>Wed, 20 Mar 2019 23:20:58 +0000</created>
                <updated>Tue, 5 Dec 2023 22:54:32 +0000</updated>
                            <resolved>Wed, 9 Oct 2019 22:07:28 +0000</resolved>
                                                                    <component>Replication</component>
                                        <votes>0</votes>
                                    <watches>23</watches>
                                                                                                                <comments>
                            <comment id="2474787" author="lingzhi.deng" created="Wed, 9 Oct 2019 22:07:28 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-43135&quot; title=&quot;Introduce a future-based API for waiting for write concern&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-43135&quot;&gt;&lt;del&gt;SERVER-43135&lt;/del&gt;&lt;/a&gt; introduced a future-based API for writeConcern waiting to reduce contention on the &lt;tt&gt;ReplicationCoordinatorImpl&lt;/tt&gt; &lt;tt&gt;_mutex&lt;/tt&gt;. Closing as a duplicate.&lt;/p&gt;</comment>
                            <comment id="2474559" author="bartle" created="Wed, 9 Oct 2019 20:01:05 +0000"  >&lt;p&gt;Thanks for the update!&lt;/p&gt;</comment>
                            <comment id="2474509" author="lingzhi.deng" created="Wed, 9 Oct 2019 19:26:19 +0000"  >&lt;p&gt;Hi &lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=bartle&quot; class=&quot;user-hover&quot; rel=&quot;bartle&quot;&gt;bartle&lt;/a&gt;, thanks for the report regarding {w: majority} performance. We have done some performance testing using an insert workload similar to the one you suggested and we were able to see contention on the &lt;tt&gt;ReplicationCoordinatorImpl&lt;/tt&gt; &lt;tt&gt;_mutex&lt;/tt&gt;.&lt;/p&gt;

&lt;p&gt;The current implementation of {w: majority} involves &lt;a href=&quot;https://github.com/mongodb/mongo/blob/0d0748ae6896c7ab235dffb2a0c8a49e16fad7f8/src/mongo/db/write_concern.cpp#L203&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;journaling&lt;/a&gt; and &lt;a href=&quot;https://github.com/mongodb/mongo/blob/0d0748ae6896c7ab235dffb2a0c8a49e16fad7f8/src/mongo/db/write_concern.cpp#L228&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;&lt;tt&gt;awaitReplication&lt;/tt&gt;&lt;/a&gt;. So, the server handling concurrent {w: majority} workloads can be loosely modeled as a system with 3 queues - CPU, journaling, and replication, with each of the three queues contending for resources. In a closed system (assuming that was the way you ran the tests), ~7% CPU utilization on a 16-core machine doesn&apos;t necessarily suggest contention on a single core. We believe that CPU is just not saturated under the overall throughput of the closed system. But it does suggest long service time (likely due to contentions) on either journaling or replication, so we have also done some profiling work for this and we found that the journaling queue dominates the overall time needed to service {w: majority} writes. While the &lt;tt&gt;ReplicationCoordinatorImpl&lt;/tt&gt; &lt;tt&gt;_mutex&lt;/tt&gt; is a hot mutex in the replication subsystem, in a closed system, the asymptotic bounds for a closed system is determined by the slowest service (based on &lt;a href=&quot;https://www.cambridge.org/core/books/performance-modeling-and-design-of-computer-systems/743BEBB137B781EDBAFD807D8F7965DF&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;this book&lt;/a&gt; Section 7.2 Asymptotic Bounds for Closed Systems). We think the low CPU utilization and bad performance were mostly due to the overall throughput dominated by journaling.&lt;/p&gt;

&lt;p&gt;That said, work has been done (mostly in &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-43135&quot; title=&quot;Introduce a future-based API for waiting for write concern&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-43135&quot;&gt;&lt;del&gt;SERVER-43135&lt;/del&gt;&lt;/a&gt;) to reduce contention on the &lt;tt&gt;ReplicationCoordinatorImpl&lt;/tt&gt; &lt;tt&gt;_mutex&lt;/tt&gt; by introducing a future-based API to relieve the &lt;a href=&quot;https://en.wikipedia.org/wiki/Thundering_herd_problem&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;thundering herd effect&lt;/a&gt; due to {w: majority} waiters waking up at the same time. As part of that ticket, we also sort waiters based on OpTime to avoid unnecessary computation in &lt;tt&gt;_wakeReadyWaiters_inlock&lt;/tt&gt; as you suggested. We have also done &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-43252&quot; title=&quot;Only compute WriteConcernResult.writtenTo for CmdGetLastError.&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-43252&quot;&gt;&lt;del&gt;SERVER-43252&lt;/del&gt;&lt;/a&gt;, &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-43307&quot; title=&quot;Avoid checking _checkIfWriteConcernCanBeSatisfied_inlock in ReplicationCoordinatorImpl::_doneWaitingForReplication_inlock&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-43307&quot;&gt;&lt;del&gt;SERVER-43307&lt;/del&gt;&lt;/a&gt; and &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-43769&quot; title=&quot;Only get the default write concern from ReplSetConfig if no write concern is specified&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-43769&quot;&gt;&lt;del&gt;SERVER-43769&lt;/del&gt;&lt;/a&gt; to shorten the critical path under the mutex.&lt;/p&gt;

&lt;p&gt;In a closed system, if a non-slowest service improves, it has marginal impact on throughput or mean response time. Thus, after the work listed above, we didn&apos;t see much improvement in the overall throughput in our tests for {w: majority} (default j: true) workloads. However, we did see 20% - 40% improvement in {w: majority, j: false} workloads when running with 256 client threads or more. This is because {w: majority, j: false} workloads do not have journaling but suffer from replication mutex contention the most. Section 7.3 on Page 118 in &lt;a href=&quot;https://www.cambridge.org/core/books/performance-modeling-and-design-of-computer-systems/743BEBB137B781EDBAFD807D8F7965DF&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;this book&lt;/a&gt; gives a similar example to what happens if the non-bottleneck part of a system improves.&lt;/p&gt;

&lt;p&gt;We already have proposals to optimize the way we journal client writes. For more details, see &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-43417&quot; title=&quot;Signal the flusher thread to flush instead of calling waitUntilDurable when waiting for {j:true}&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-43417&quot;&gt;&lt;del&gt;SERVER-43417&lt;/del&gt;&lt;/a&gt;. We have done a proof of concept for the optimization and we were able to see 20% - 70% gain for {w: majority} workloads with &amp;gt;= 128 client threads. We will consider it after the &#8220;Replicate Before Journaling&#8221; project (&lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-41392&quot; title=&quot;Modify the _oplogJournalThreadLoop() to no longer call waitUntilDurable() and instead update the oplogTruncateAfterPoint&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-41392&quot;&gt;&lt;del&gt;SERVER-41392&lt;/del&gt;&lt;/a&gt;).&lt;/p&gt;</comment>
                            <comment id="2187036" author="bartle" created="Thu, 21 Mar 2019 02:40:09 +0000"  >&lt;p&gt;Another thing that&apos;d be nice would be to expose lock metrics (lock held time, wait time, etc...) for these low-level mutexes, similar to what&apos;s exposed for MongoDB-style multi-granular locks.&lt;/p&gt;</comment>
                            <comment id="2186847" author="bartle" created="Wed, 20 Mar 2019 23:46:59 +0000"  >&lt;p&gt;Yeah, I agree this particular code hasn&apos;t really changed from 3.4 to master.&lt;/p&gt;</comment>
                            <comment id="2186846" author="bartle" created="Wed, 20 Mar 2019 23:46:08 +0000"  >&lt;p&gt;This was an insert load with 256 threads on 3.4.&#160; The inserts themselves were pretty small (~1k documents with a few fields), and were just single-document inserts.&lt;/p&gt;</comment>
                            <comment id="2186831" author="schwerin" created="Wed, 20 Mar 2019 23:25:44 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=bartle%40stripe.com&quot; class=&quot;user-hover&quot; rel=&quot;bartle@stripe.com&quot;&gt;bartle@stripe.com&lt;/a&gt;, in addition to the information you&apos;ve already supplied, I&apos;m curious to know approximately how many simultaneously executing client threads your workload uses. This will make it easier for us to compare it to our existing performance workloads, in case they need to be extended to cover this case. If you can share code for a representative workload, that would of course be valuable, but I don&apos;t think it&apos;s strictly required in this case. In any event, please watch this ticket to track the issue, rather than &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-31694&quot; title=&quot;17% throughput regression in insert workload&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-31694&quot;&gt;&lt;del&gt;SERVER-31694&lt;/del&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Oh, on which version of MongoDB did you perform this analysis? The core implementation of waking waiters hasn&apos;t changed much in the last 4 or 5 years, so I imagine the basic problem exists on all versions, but it may help to know.&lt;/p&gt;

&lt;p&gt;The ReplicationCoordinator is a bit of a kitchen sink of functionality today, and breaking it up into logical pieces is going to be an important part of making a maintainable system of tracking write concern satisfaction that scales to higher core counts efficiently. I&apos;m hesitant to endorse a solution with reader-writer locks, as the frequency of writes under the existing mutex is quite high, but longer term I imagine a finer-grained locking solution will be important. In the short term, restructuring the wake-up logic as you suggest might be workable.&lt;/p&gt;</comment>
                            <comment id="2186828" author="schwerin" created="Wed, 20 Mar 2019 23:24:11 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=bartle%40stripe.com&quot; class=&quot;user-hover&quot; rel=&quot;bartle@stripe.com&quot;&gt;bartle@stripe.com&lt;/a&gt;&apos;s &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-31694?focusedCommentId=2185867&amp;amp;page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-2185867&quot; class=&quot;external-link&quot; rel=&quot;nofollow&quot;&gt;comment&lt;/a&gt; on &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-31694&quot; title=&quot;17% throughput regression in insert workload&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-31694&quot;&gt;&lt;del&gt;SERVER-31694&lt;/del&gt;&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Are there plans to improve performance of&#160;&lt;tt&gt;setMyLastAppliedOpTimeForward&lt;/tt&gt;?&#160; On a write-majority, insert-heavy workflow we basically see single-core contention on &lt;tt&gt;setMyLastAppliedOpTimeForward&lt;/tt&gt; (based on a CPU profile).&#160; That particular function takes an exclusive mutex, so it&apos;s unsurprising that if you push enough write-majority writes through you&apos;d contend on a single core (in practice we&apos;re hitting a bottleneck of 12k wps, on a 16-core machine, with ~7% CPU usage).&lt;/p&gt;

&lt;p&gt;Ultimately. all of the CPU ends up in &lt;tt&gt;_wakeReadyWaiters_inlock&lt;/tt&gt;.&#160; That particular implementation seems rather naive; it ends up recomputing a bunch of things (again, under a global, exclusive lock) for every replication waiter.&#160; Instead, it seems like you should structure this code such that it determines the largest optime that satisfies the various write concern modes (basically &quot;majority&quot; and w=&quot;N&quot;) once, and then pass that information down into &lt;tt&gt;_doneWaitingForReplication_inlock&lt;/tt&gt;.&lt;/p&gt;

&lt;p&gt;Beyond this, reading through the code, it&apos;s fairly concerning how coarse-grained &lt;tt&gt;_mutex&lt;/tt&gt; on &lt;tt&gt;ReplicationCoordinatorImpl&lt;/tt&gt; is.&#160; Is there a reason more work hasn&apos;t been invested in finer-grained locks, or even reader-writer locks?&#160; As-is, it&apos;s really difficult to make any perf improvements.&lt;/p&gt;&lt;/blockquote&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="915296">SERVER-43135</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10012">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="449967">SERVER-31694</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10050" key="com.atlassian.jira.toolkit:comments">
                        <customfieldname># Replies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>8.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                <customfield id="customfield_12751" key="com.atlassian.jira.plugin.system.customfieldtypes:multiselect">
                        <customfieldname>Assigned Teams</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="25128"><![CDATA[Replication]]></customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_13552" key="com.go2group.jira.plugin.crm:crm_generic_field">
                        <customfieldname>Case</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[[5006R00001mff0lQAA]]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10055" key="com.atlassian.jira.ext.charting:firstresponsedate">
                        <customfieldname>Date of 1st Reply</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Wed, 20 Mar 2019 23:46:08 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10052" key="com.atlassian.jira.toolkit:dayslastcommented">
                        <customfieldname>Days since reply</customfieldname>
                        <customfieldvalues>
                                        4 years, 18 weeks ago
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18254" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Dependencies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[]]></customfieldvalue>


                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10857" key="com.pyxis.greenhopper.jira:gh-epic-link">
                        <customfieldname>Epic Link</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>PM-1456</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10057" key="com.atlassian.jira.toolkit:lastusercommented">
                        <customfieldname>Last comment by Customer</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>true</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10056" key="com.atlassian.jira.toolkit:lastupdaterorcommenter">
                        <customfieldname>Last commenter</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>louis.williams@mongodb.com</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_11151" key="com.atlassian.jira.toolkit:LastCommentDate">
                        <customfieldname>Last public comment date</customfieldname>
                        <customfieldvalues>
                            4 years, 18 weeks ago
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10032" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Operating System</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10026"><![CDATA[ALL]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_10051" key="com.atlassian.jira.toolkit:participants">
                        <customfieldname>Participants</customfieldname>
                        <customfieldvalues>
                                        <customfieldvalue>schwerin@mongodb.com</customfieldvalue>
            <customfieldvalue>backlog-server-repl</customfieldvalue>
            <customfieldvalue>bartle</customfieldvalue>
            <customfieldvalue>lingzhi.deng@mongodb.com</customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_14254" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Product Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hure1z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|hugzev:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_10053" key="com.atlassian.jira.ext.charting:timeinstatus">
                        <customfieldname>Time In Status</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_22870" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Triagers</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_14350" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>serverRank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hur0bb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                    </customfields>
    </item>
</channel>
</rss>