<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 05:43:32 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[SERVER-58081]  _flushReshardingStateChange from coordinator races with donor shard acquiring critical section, stalling the resharding operation</title>
                <link>https://jira.mongodb.org/browse/SERVER-58081</link>
                <project id="10000" key="SERVER">Core Server</project>
                    <description>&lt;p&gt;The _flushReshardingStateChanges command &lt;a href=&quot;https://github.com/mongodb/mongo/blob/29f6ac652246f05eec30e8d7838acbb3dd6c909f/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L1080&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;will stall the coordinator&lt;/a&gt; if the critical section is acquired by another thread after its &lt;a href=&quot;https://github.com/mongodb/mongo/blob/29f6ac652246f05eec30e8d7838acbb3dd6c909f/src/mongo/db/s/resharding/resharding_donor_recipient_common.cpp#L407&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;initial check&lt;/a&gt; to see if the critical section since onShardVersionMismatch() blocks until the critical section is released.&lt;/p&gt;

&lt;p&gt;Shards during a resharding operation also&#160; &lt;a href=&quot;https://github.com/mongodb/mongo/blob/3befdc7d70fa56085bbdc9606da0db84b5b48ccd/src/mongo/db/s/resharding/resharding_donor_recipient_common.cpp#L340&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;rely on refreshShardVersion() to be triggered after a new primary has stepped up&lt;/a&gt; for the DonorStateMachine and RecipientStateMachines to learn of a change to the coordinator&apos;s state. &lt;/p&gt;

&lt;p&gt;The following events can cause a resharding operation to stall indefinitely waiting for _flushReshardingStateChanges to complete:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;A donor is killed before the coordinator transitions to kBlockingWrites&lt;/li&gt;
	&lt;li&gt;The coordinator transitions to kBlockingWrites before a new primary steps up on the donor&lt;/li&gt;
	&lt;li&gt;The coordinator &lt;a href=&quot;https://github.com/mongodb/mongo/blob/29f6ac652246f05eec30e8d7838acbb3dd6c909f/src/mongo/db/s/resharding/resharding_coordinator_service.cpp#L1080&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;tries to inform each DonorStateMachine&lt;/a&gt; that it is safe to acquire the critical section via the&#160; _flushReshardingStateChanges cmd&lt;/li&gt;
	&lt;li&gt;The new primary on the donor steps up, and both recovery and _flushRoutingStateChanges cmd try to refresh the DonorStateMachine&lt;/li&gt;
	&lt;li&gt;The _flushReshardingStateChanges thread checks to &lt;a href=&quot;https://github.com/mongodb/mongo/blob/29f6ac652246f05eec30e8d7838acbb3dd6c909f/src/mongo/db/s/resharding/resharding_donor_recipient_common.cpp#L407&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;see if the critical section has been acquired&lt;/a&gt;, it hasn&apos;t yet, and calls onShardVersionMismatch()&lt;/li&gt;
	&lt;li&gt;The recovery thread also&#160; triggers onShardVersionMismatch(), beats the _flushReshardingStateChanges thread, and refreshes the DonorStateMachine which &lt;a href=&quot;https://github.com/mongodb/mongo/blob/29f6ac652246f05eec30e8d7838acbb3dd6c909f/src/mongo/db/s/resharding/resharding_donor_service.cpp#L603&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;then acquires the critical section&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;_flushReshardingStateChanges thread reaches &lt;a href=&quot;https://github.com/mongodb/mongo/blob/29f6ac652246f05eec30e8d7838acbb3dd6c909f/src/mongo/db/s/shard_filtering_metadata_refresh.cpp#L120&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;a second check to see if the critical section is engaged&lt;/a&gt;, it is (thanks to the recovery thread), and the _flushReshardingStateChanges thread is blocked until the DonorStateMachine releases the critical section&lt;/li&gt;
	&lt;li&gt;The DonorStateMachine can&apos;t release the critical section until the coordinator transitions to kCommitting/kAborting and the coordinator cannot make it past _tellAllDonorsToRefresh until the _flushReshardingStateChanges command completes.&lt;/li&gt;
&lt;/ul&gt;
</description>
                <environment></environment>
        <key id="1797256">SERVER-58081</key>
            <summary> _flushReshardingStateChange from coordinator races with donor shard acquiring critical section, stalling the resharding operation</summary>
                <type id="1" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14703&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.mongodb.org/images/icons/priorities/major.svg">Major - P3</priority>
                        <status id="6" iconUrl="https://jira.mongodb.org/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="13201">Fixed</resolution>
                                        <assignee username="haley.connelly@mongodb.com">Haley Connelly</assignee>
                                    <reporter username="haley.connelly@mongodb.com">Haley Connelly</reporter>
                        <labels>
                            <label>PM-234-M3</label>
                            <label>PM-234-T-lifecycle</label>
                    </labels>
                <created>Thu, 24 Jun 2021 22:36:51 +0000</created>
                <updated>Sun, 29 Oct 2023 21:51:38 +0000</updated>
                            <resolved>Wed, 28 Jul 2021 17:07:45 +0000</resolved>
                                                    <fixVersion>5.0.3</fixVersion>
                    <fixVersion>5.1.0-rc0</fixVersion>
                                    <component>Sharding</component>
                                        <votes>0</votes>
                                    <watches>1</watches>
                                                                                                                <comments>
                            <comment id="4107570" author="JIRAUSER1259052" created="Wed, 6 Oct 2021 18:30:12 +0000"  >&lt;p&gt;Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it&#8217;s been triggered. For more active release information, please keep an eye on #server-release. Thank you!&lt;/p&gt;</comment>
                            <comment id="4010816" author="xgen-internal-githook" created="Thu, 19 Aug 2021 19:18:26 +0000"  >&lt;p&gt;Author:&lt;/p&gt;
{&apos;name&apos;: &apos;Haley Connelly&apos;, &apos;email&apos;: &apos;haley.connelly@mongodb.com&apos;, &apos;username&apos;: &apos;haleyConnelly&apos;}
&lt;p&gt;Message: &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-58081&quot; title=&quot; _flushReshardingStateChange from coordinator races with donor shard acquiring critical section, stalling the resharding operation&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-58081&quot;&gt;&lt;del&gt;SERVER-58081&lt;/del&gt;&lt;/a&gt; Make _flushReshardingStateChange return instead of blocking if the critical section is held&lt;/p&gt;

&lt;p&gt;(cherry picked from commit 2ca1f733d619809d1e712860fc0070f0cc8d81f5)&lt;br/&gt;
Branch: v5.0&lt;br/&gt;
&lt;a href=&quot;https://github.com/mongodb/mongo/commit/98937ba21a64a127d6238d641fb676bdef797cf4&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/mongodb/mongo/commit/98937ba21a64a127d6238d641fb676bdef797cf4&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="3968609" author="xgen-internal-githook" created="Wed, 28 Jul 2021 17:06:07 +0000"  >&lt;p&gt;Author:&lt;/p&gt;
{&apos;name&apos;: &apos;Haley Connelly&apos;, &apos;email&apos;: &apos;haley.connelly@mongodb.com&apos;, &apos;username&apos;: &apos;haleyConnelly&apos;}
&lt;p&gt;Message: &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-58081&quot; title=&quot; _flushReshardingStateChange from coordinator races with donor shard acquiring critical section, stalling the resharding operation&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-58081&quot;&gt;&lt;del&gt;SERVER-58081&lt;/del&gt;&lt;/a&gt; Make _flushReshardingStateChange return instead of blocking if the critical section is held&lt;br/&gt;
Branch: master&lt;br/&gt;
&lt;a href=&quot;https://github.com/mongodb/mongo/commit/2ca1f733d619809d1e712860fc0070f0cc8d81f5&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/mongodb/mongo/commit/2ca1f733d619809d1e712860fc0070f0cc8d81f5&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="3919482" author="max.hirschhorn@10gen.com" created="Wed, 7 Jul 2021 00:06:11 +0000"  >&lt;p&gt;Checking for whether the critical section is currently held is race prone. As outlined with the sequence of events in the ticket&apos;s description, it is possible for thread executing DonorStateMachine::run() to be about to acquire the critical section. I feel like there are two possible approaches but only one of them is considered viable:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;(a) Make it possible for the RecoverRefreshThread to refresh the shard version while the critical section is held. This would make it safe for a shard to receive the &amp;#95;flushReshardingStateChange command twice even after processing its effects once before. However, having the RecoverRefreshThread wait on the critical section being released is intentional to avoid mongos exhausting its StaleConfig exception retries before a chunk migration commits. We would need to change commands to wait on the critical section being released themselves instead of indirectly waiting through the shard version refresh not having completed yet.&lt;/li&gt;
	&lt;li&gt;(b) Change the &amp;#95;flushReshardingStateChange command so it asynchronously scheduled a shard version refresh and doesn&apos;t require the resharding coordinator to wait for the shard veresion refresh to complete.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;My proposal would be to implement option (b) by changing the &amp;#95;flushReshardingStateChange command to the following:&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;Call onShardVersionMismatch() in a task scheduled on an arbitrary executor pool. The usage of the arbitrary executor pool is intentional to avoid exhausting the threads available in the fixed executor. Note that this thread in the arbitrary executor pool will block until the critical section is released still.&lt;/li&gt;
	&lt;li&gt;Wait for the donor and/or recipient state documents to have been inserted locally. This would be done by exposing a new SharedSemiFuture&amp;lt;void&amp;gt; on DonorStateMachine and RecipientStateMachine. These functions would be immediately fulfilled when DonorStateMachine and RecipientStateMachine recovers on step&amp;#45;up.
	&lt;ul&gt;
		&lt;li&gt;&lt;a href=&quot;https://github.com/mongodb/mongo/blob/e1f9a3922d20886691e4c22c4fa63ba81a89b3d7/src/mongo/db/s/resharding/resharding_donor_recipient_common.cpp#L68-L71&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;The donor and recipient state documents being inserted is the only side effect of the shard version refresh that the resharding coordinator must wait for&lt;/a&gt;.&lt;/li&gt;
	&lt;/ul&gt;
	&lt;/li&gt;
	&lt;li&gt;Insert a no-op oplog entry. This ensures in combination with waiting for majority write concern that the resharding coordinator cannot have run the &amp;#95;flushReshardingStateChange command on a stale primary.&lt;/li&gt;
	&lt;li&gt;Wait for majority write concern. Note that this is automatic from the w:majority write concern that the ReshardingCoordinator attaches to the &amp;#95;flushReshardingStateChange command already.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Note that the CatalogCacheLoader::waitForCollectionFlush() line which was copied from &amp;#95;flushRoutingTableCacheUpdatesWithWriteConcern isn&apos;t necessary for the &amp;#95;flushReshardingStateChange command. The only dependency on the config.cache.chunks collection having been written locally is for the temporary resharding collection&apos;s on the donor shards and &lt;a href=&quot;https://github.com/mongodb/mongo/blob/e1f9a3922d20886691e4c22c4fa63ba81a89b3d7/src/mongo/db/s/resharding/resharding_donor_service.cpp#L547&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;is handled by DonorStateMachine itself already&lt;/a&gt;.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10420">
                    <name>Backports</name>
                                            <outwardlinks description="backported by">
                                                        </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10011">
                    <name>Depends</name>
                                                                <inwardlinks description="is depended on by">
                                        <issuelink>
            <issuekey id="1811271">SERVER-58343</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10050" key="com.atlassian.jira.toolkit:comments">
                        <customfieldname># Replies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18555" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname># of Sprints</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>3.0</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_12450" key="com.atlassian.jira.plugin.system.customfieldtypes:multicheckboxes">
                        <customfieldname>Backport Requested</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="21777"><![CDATA[v5.0]]></customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10011" key="com.atlassian.jira.plugin.system.customfieldtypes:radiobuttons">
                        <customfieldname>Backwards Compatibility</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10038"><![CDATA[Fully Compatible]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10055" key="com.atlassian.jira.ext.charting:firstresponsedate">
                        <customfieldname>Date of 1st Reply</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Wed, 7 Jul 2021 00:06:11 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10052" key="com.atlassian.jira.toolkit:dayslastcommented">
                        <customfieldname>Days since reply</customfieldname>
                        <customfieldvalues>
                                        2 years, 18 weeks ago
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18254" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Dependencies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[]]></customfieldvalue>


                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_17050" key="com.atlassian.jira.plugin.system.customfieldtypes:radiobuttons">
                        <customfieldname>Downstream Team Attention</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="16941"><![CDATA[Not Needed]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10857" key="com.pyxis.greenhopper.jira:gh-epic-link">
                        <customfieldname>Epic Link</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>PM-234</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10057" key="com.atlassian.jira.toolkit:lastusercommented">
                        <customfieldname>Last comment by Customer</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>true</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10056" key="com.atlassian.jira.toolkit:lastupdaterorcommenter">
                        <customfieldname>Last commenter</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>luke.bonanomi@mongodb.com</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_11151" key="com.atlassian.jira.toolkit:LastCommentDate">
                        <customfieldname>Last public comment date</customfieldname>
                        <customfieldvalues>
                            2 years, 18 weeks ago
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                <customfield id="customfield_10051" key="com.atlassian.jira.toolkit:participants">
                        <customfieldname>Participants</customfieldname>
                        <customfieldvalues>
                                        <customfieldvalue>xgen-internal-githook</customfieldvalue>
            <customfieldvalue>haley.connelly@mongodb.com</customfieldvalue>
            <customfieldvalue>max.hirschhorn@mongodb.com</customfieldvalue>
            <customfieldvalue>vivian.ge@mongodb.com</customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_14254" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Product Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzol53:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|hz91on:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_23361" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Requested By</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10557" key="com.pyxis.greenhopper.jira:gh-sprint">
                        <customfieldname>Sprint</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue id="4521">Sharding 2021-07-12</customfieldvalue>
    <customfieldvalue id="4522">Sharding 2021-07-26</customfieldvalue>
    <customfieldvalue id="5218">Sharding 2021-08-09</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10555" key="com.atlassian.jira.plugin.system.customfieldtypes:float">
                        <customfieldname>Story Points</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10053" key="com.atlassian.jira.ext.charting:timeinstatus">
                        <customfieldname>Time In Status</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_22870" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Triagers</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_14350" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>serverRank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hzo7e7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                    </customfields>
    </item>
</channel>
</rss>