<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 05:50:01 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[SERVER-60521] Deadlock on stepup due to moveChunk command running uninterrupted on secondary</title>
                <link>https://jira.mongodb.org/browse/SERVER-60521</link>
                <project id="10000" key="SERVER">Core Server</project>
                    <description>&lt;p&gt;Consider a shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration.&lt;br/&gt;
In parallel, in that same node, another moveChunk just arrived while it was still primary, but didn&apos;t yet execute past &lt;a href=&quot;https://github.com/mongodb/mongo/blob/b348fc023c809d9594b37d12a0640f3bdb6efe20/src/mongo/db/s/move_chunk_command.cpp#L136&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;this&lt;/a&gt;. Now the stepdown completes and this second move chunk continues and is able to &lt;a href=&quot;https://github.com/mongodb/mongo/blob/b348fc023c809d9594b37d12a0640f3bdb6efe20/src/mongo/db/s/move_chunk_command.cpp#L137-L138&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;register the migration&lt;/a&gt; (since the first migration already unregistered from the ActiveMigrationRegistry). A new ThreadClient will be created and it will be &lt;a href=&quot;https://github.com/mongodb/mongo/blob/b348fc023c809d9594b37d12a0640f3bdb6efe20/src/mongo/db/s/move_chunk_command.cpp#L152-L156&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;maked as killable on stepdown&lt;/a&gt;. However, since the node already transitioned to secondary, it won&apos;t actually get killed.&lt;/p&gt;

&lt;p&gt;Consider the following interleaving:&lt;br/&gt;
1.  A shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration.&lt;br/&gt;
2. In parallel, in that same node, another moveChunk just arrived while it was still primary, but didn&apos;t yet execute past &lt;a href=&quot;https://github.com/mongodb/mongo/blob/b348fc023c809d9594b37d12a0640f3bdb6efe20/src/mongo/db/s/move_chunk_command.cpp#L136&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;this&lt;/a&gt;.&lt;br/&gt;
3. The stepdown completes&lt;br/&gt;
4. The second move chunk continues and is able to &lt;a href=&quot;https://github.com/mongodb/mongo/blob/b348fc023c809d9594b37d12a0640f3bdb6efe20/src/mongo/db/s/move_chunk_command.cpp#L137-L138&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;register the migration&lt;/a&gt; (since the first migration already unregistered from the ActiveMigrationRegistry). A new ThreadClient will be created and it will be &lt;a href=&quot;https://github.com/mongodb/mongo/blob/b348fc023c809d9594b37d12a0640f3bdb6efe20/src/mongo/db/s/move_chunk_command.cpp#L152-L156&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;maked as killable on stepdown&lt;/a&gt;. However, since the node already transitioned to secondary, it won&apos;t actually get killed.&lt;br/&gt;
5. The old primary that just stepped down wins the election and becomes primary again.&lt;br/&gt;
6. During stepup, the primary will see that there was a migration ongoing (the one started in (1)), so it will attempt to recover it. To do so, it needs to &lt;a href=&quot;https://github.com/mongodb/mongo/blob/b348fc023c809d9594b37d12a0640f3bdb6efe20/src/mongo/db/s/migration_util.cpp#L928-L932&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;acquire the MigrationBlockingGuard&lt;/a&gt; while still on drain mode. However, since the migration started in (2) managed to register on the ActiveMigrationRegistry, the MigrationBlockingGuard cannot be acquired and waits.&lt;br/&gt;
7. On the other side, the migration (2) is not able to make progress because the stepup has a global lock taken, so it will never be able to release the ActiveMigrationRegistry.&lt;/p&gt;

&lt;p&gt;To fix this we should make sure that moveChunk cannot run uninterrupted on a secondary.&lt;/p&gt;</description>
                <environment></environment>
        <key id="1892634">SERVER-60521</key>
            <summary>Deadlock on stepup due to moveChunk command running uninterrupted on secondary</summary>
                <type id="1" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14703&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.mongodb.org/images/icons/priorities/major.svg">Major - P3</priority>
                        <status id="6" iconUrl="https://jira.mongodb.org/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="13203">Gone away</resolution>
                                        <assignee username="sergi.mateo-bellido@mongodb.com">Sergi Mateo Bellido</assignee>
                                    <reporter username="jordi.serra-torrens@mongodb.com">Jordi Serra Torrens</reporter>
                        <labels>
                            <label>sharding-wfbf-sprint</label>
                            <label>shardingemea-qw</label>
                    </labels>
                <created>Thu, 7 Oct 2021 14:44:31 +0000</created>
                <updated>Fri, 27 Oct 2023 20:45:54 +0000</updated>
                            <resolved>Wed, 16 Feb 2022 11:53:24 +0000</resolved>
                                    <version>4.4.0</version>
                    <version>5.0.0</version>
                    <version>5.1.0-rc0</version>
                                                    <component>Sharding</component>
                                        <votes>0</votes>
                                    <watches>7</watches>
                                                                                                                <comments>
                            <comment id="4358300" author="JIRAUSER1256927" created="Wed, 16 Feb 2022 11:53:24 +0000"  >&lt;p&gt;We did several fixes to the moveChunk recently, removing this deadlock.&lt;/p&gt;

&lt;p&gt;With &lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=jordi.serra-torrens&quot; class=&quot;user-hover&quot; rel=&quot;jordi.serra-torrens&quot;&gt;jordi.serra-torrens&lt;/a&gt;&#160;we analyzed what would happen in that scenario and everything seemed ok.&lt;/p&gt;</comment>
                            <comment id="4356320" author="JIRAUSER1256927" created="Tue, 15 Feb 2022 16:14:45 +0000"  >&lt;p&gt;The deadlock described in this ticket cannot happen anymore since we don&apos;t acquire the MigrationBlockingGuard as part of &lt;tt&gt;resumeMigrationCoordinationsOnStepUp&lt;/tt&gt; (&lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-62245&quot; title=&quot;MigrationRecovery must not assume that only one migration needs to be recovered&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-62245&quot;&gt;&lt;del&gt;SERVER-62245&lt;/del&gt;&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Before analyzing what it would happen on master, I would like to mention two relevant tasks that we implemented recently:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;&lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-62296&quot; title=&quot;MoveChunk should recover any unfinished migration before starting a new one&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-62296&quot;&gt;&lt;del&gt;SERVER-62296&lt;/del&gt;&lt;/a&gt;: MoveChunk should recover any unfinished migration before starting a new one.&lt;/li&gt;
	&lt;li&gt;SERVER-62704: Marking the moveChunk operation killable on step-down/step-up.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;About the original problem (no changes from 1 to 5):&lt;br/&gt;
 1. A shard that was running a moveChunk and had already persisted the migration recovery document. Then it stepsdown, so the new primary will need to recover the migration.&lt;br/&gt;
 2. In parallel, in that same node, another moveChunk just arrived while it was still primary, but didn&apos;t yet execute past &lt;a href=&quot;https://github.com/mongodb/mongo/blob/b348fc023c809d9594b37d12a0640f3bdb6efe20/src/mongo/db/s/move_chunk_command.cpp#L136&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;this&lt;/a&gt;.&lt;br/&gt;
 3. The stepdown completes.&lt;br/&gt;
 4. The second move chunk continues and is able to &lt;a href=&quot;https://github.com/mongodb/mongo/blob/b348fc023c809d9594b37d12a0640f3bdb6efe20/src/mongo/db/s/move_chunk_command.cpp#L137-L138&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;register the migration&lt;/a&gt; (since the first migration already unregistered from the ActiveMigrationRegistry). A new ThreadClient will be created and it will be &lt;a href=&quot;https://github.com/mongodb/mongo/blob/b348fc023c809d9594b37d12a0640f3bdb6efe20/src/mongo/db/s/move_chunk_command.cpp#L152-L156&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;maked as killable on stepdown&lt;/a&gt;. However, since the node already transitioned to secondary, it won&apos;t actually get killed.&lt;br/&gt;
 5. The old primary that just stepped down wins the election and becomes primary again.&lt;br/&gt;
 ---- NEW STUFF ----&lt;br/&gt;
 6. If the second moveChunk had already &lt;a href=&quot;https://github.com/10gen/mongo/blob/eeff4a62ad0702abfe3d599e16696baefc6c8cec/src/mongo/db/s/move_chunk_command.cpp#L135-L142&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;acquired the global lock in IX mode&lt;/a&gt;, the whole operation would be killed as part of the step-up. Otherwise, the second moveChunk would block until the step-up is completed and the global lock is released.&lt;br/&gt;
 7. During stepup, the primary will see that there was a migration ongoing (the one started in (1)), so it will attempt to recover it. It is not a problem that the second moveChunk might be alive holding the &lt;tt&gt;ActiveMigrationRegistry&lt;/tt&gt; since the &lt;tt&gt;resumeMigrationCoordinationsOnStepUp&lt;/tt&gt; doesn&apos;t acquire the &lt;tt&gt;MigrationBlockingGuard&lt;/tt&gt; anymore.&lt;br/&gt;
 8. Once the stepup is completed, if the second moveChunk wasn&apos;t killed, it will acquire the global lock in IX and it will be executed as if it has justr arrived to the shard.&lt;/p&gt;</comment>
                            <comment id="4197296" author="kaloian.manassiev" created="Thu, 18 Nov 2021 12:40:59 +0000"  >&lt;p&gt;This is still a problem, but very unlikely and is not causing noise in our testing. Fix requires an iteration so putting it under the WFBF sprint category.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10012">
                    <name>Related</name>
                                            <outwardlinks description="related to">
                                        <issuelink>
            <issuekey id="1955958">SERVER-62245</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="1881402">SERVER-60161</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="1956917">SERVER-62296</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="2148681">SERVER-70127</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                            <attachment id="339336" name="0001-SERVER-60521-repro.patch" size="5347" author="jordi.serra-torrens@mongodb.com" created="Thu, 7 Oct 2021 14:46:27 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10050" key="com.atlassian.jira.toolkit:comments">
                        <customfieldname># Replies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>3.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18555" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname># of Sprints</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2.0</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_10055" key="com.atlassian.jira.ext.charting:firstresponsedate">
                        <customfieldname>Date of 1st Reply</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Thu, 7 Oct 2021 15:00:17 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10052" key="com.atlassian.jira.toolkit:dayslastcommented">
                        <customfieldname>Days since reply</customfieldname>
                        <customfieldvalues>
                                        1 year, 51 weeks ago
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18254" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Dependencies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[]]></customfieldvalue>


                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_10057" key="com.atlassian.jira.toolkit:lastusercommented">
                        <customfieldname>Last comment by Customer</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>true</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10056" key="com.atlassian.jira.toolkit:lastupdaterorcommenter">
                        <customfieldname>Last commenter</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>luke.bonanomi@mongodb.com</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_11151" key="com.atlassian.jira.toolkit:LastCommentDate">
                        <customfieldname>Last public comment date</customfieldname>
                        <customfieldvalues>
                            1 year, 51 weeks ago
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_16465" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Linked BF Score</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>0.0</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10032" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Operating System</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10026"><![CDATA[ALL]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_10051" key="com.atlassian.jira.toolkit:participants">
                        <customfieldname>Participants</customfieldname>
                        <customfieldvalues>
                                        <customfieldvalue>jordi.serra-torrens@mongodb.com</customfieldvalue>
            <customfieldvalue>kaloian.manassiev@mongodb.com</customfieldvalue>
            <customfieldvalue>sergi.mateo-bellido@mongodb.com</customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_14254" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Product Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i04qsf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|hzoh2f:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_23361" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Requested By</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10557" key="com.pyxis.greenhopper.jira:gh-sprint">
                        <customfieldname>Sprint</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue id="5425">Sharding EMEA 2021-10-18</customfieldvalue>
    <customfieldvalue id="5749">Sharding EMEA 2022-02-21</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10750" key="com.atlassian.jira.plugin.system.customfieldtypes:textarea">
                        <customfieldname>Steps To Reproduce</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>&lt;p&gt; &lt;span class=&quot;nobr&quot;&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/attachment/339336/339336_0001-SERVER-60521-repro.patch&quot; title=&quot;0001-SERVER-60521-repro.patch attached to SERVER-60521&quot;&gt;0001-SERVER-60521-repro.patch&lt;sup&gt;&lt;img class=&quot;rendericon&quot; src=&quot;https://jira.mongodb.org/images/icons/link_attachment_7.gif&quot; height=&quot;7&quot; width=&quot;7&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/sup&gt;&lt;/a&gt;&lt;/span&gt; &lt;/p&gt;</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    <customfield id="customfield_10053" key="com.atlassian.jira.ext.charting:timeinstatus">
                        <customfieldname>Time In Status</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_22870" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Triagers</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_14350" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>serverRank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|i04cxr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                    </customfields>
    </item>
</channel>
</rss>