<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 03:33:54 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[SERVER-14117] moveChunk should attempt to retry write errors during chunk cleanup</title>
                <link>https://jira.mongodb.org/browse/SERVER-14117</link>
                <project id="10000" key="SERVER">Core Server</project>
                    <description>&lt;p&gt;Current Mongos  will return complete as soon as  a chunkMove hit an error in phase 6.  It should should retry  based on a  config.settings.moveRetries=3.&lt;/p&gt;

&lt;p&gt;The default would be 0 to preserve previous behavior however this is very helpful to avoid  orphans to begin with.  I am aware we have a new function to clean them but  you can still have logical DB corruption in the mean time.&lt;/p&gt;

</description>
                <environment></environment>
        <key id="139437">SERVER-14117</key>
            <summary>moveChunk should attempt to retry write errors during chunk cleanup</summary>
                <type id="4" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14710&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="3" iconUrl="https://jira.mongodb.org/images/icons/priorities/major.svg">Major - P3</priority>
                        <status id="6" iconUrl="https://jira.mongodb.org/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="greg_10gen">Greg Studer</assignee>
                                    <reporter username="david.b.murphy.tx@gmail.com">David Murphy</reporter>
                        <labels>
                    </labels>
                <created>Sat, 31 May 2014 02:25:35 +0000</created>
                <updated>Wed, 10 Dec 2014 23:06:07 +0000</updated>
                            <resolved>Tue, 22 Jul 2014 21:11:58 +0000</resolved>
                                                                    <component>Sharding</component>
                                        <votes>0</votes>
                                    <watches>6</watches>
                                                                                                                <comments>
                            <comment id="668885" author="greg_10gen" created="Wed, 23 Jul 2014 15:50:27 +0000"  >&lt;p&gt;&amp;gt; If we wait for #2 we are purposefully leaving the system with data that ChunkManager unsafe commands like count will see and thus return the wrong data.&lt;/p&gt;

&lt;p&gt;This is actually a separate but related issue (&lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-3645&quot; title=&quot;Sharded collection counts (on primary) can report too many results&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-3645&quot;&gt;&lt;del&gt;SERVER-3645&lt;/del&gt;&lt;/a&gt;, though you probably are watching that as well) - commands on the primary like count should not see this data.  It&apos;s currently hard to fix, however, only because there are performance implications for non-shard-key counts.  By design, during migrations there will always be data outside the chunk ranges - commands must be equipped to deal with this (most are right now).&lt;/p&gt;

&lt;p&gt;&amp;gt; as who is the primary on a given shard is not actually important&lt;/p&gt;

&lt;p&gt;I think there&apos;s a misunderstanding here - the chunk cleanup (and all stages of migration) are driven by the primary host of the FROM shard.  Mongos just passes along moveChunk to the shard, and receives &quot;ok&quot; when the logical migration is finished.  The cleanup may not have happened yet, since that&apos;s a heuristic enforced by mongod, and there&apos;s nothing it knows to retry.  Failures during the migration itself mongos often does retry (sometimes indefinitely) if the migration is driven by the balancer, because balancing is deterministic per-collection.&lt;/p&gt;

&lt;p&gt;N retries would require new state to track &quot;attempted cleanups&quot; on mongod hosts and synchronization with replication and lazy metadata load - at that point you&apos;re designing a &quot;background cleanup process&quot; with a prioritized queue (and we basically have this with RangeDeleter, though it needs some love, if you&apos;d like to look).&lt;/p&gt;</comment>
                            <comment id="668057" author="dmurphy" created="Tue, 22 Jul 2014 21:34:39 +0000"  >&lt;p&gt;True enough however it still would be best to retry more than a single  time. For example   an operation being killed,  a network glitch or other election that re-elects the same primary would all be situations where it could retry and avoid an orphan.&lt;/p&gt;

&lt;p&gt;The  point here is to make best effort to ensure it is unable to do  the delete. In fact retrying the delete ( as who is the primary on a given shard is not actually important)  would be  a best case,  as it would ensure that the cleanup phase  was smart enough to persist. &lt;/p&gt;


&lt;p&gt;I think there are 2 sides to this issue&lt;/p&gt;

&lt;p&gt;1) Make a  reasonable effort to prevent the  need for more cleanup and/or orphan removal (1 or 2 retries)&lt;br/&gt;
2) Have tooling to regularlly check back and do removals of  failed cleanups/ orphans &lt;/p&gt;

&lt;p&gt;With the orphan cleanup command being the last ditch effort. If we wait for #2 we are purposefully leaving the system with data  that   ChunkManager unsafe commands like count  will  see and thus return the wrong data.&lt;/p&gt;

&lt;p&gt;A quick retry loop  with a config.settings options  seems to very little effort to combat a very real issue that plagues  all versions today, with only 2.6 having the start of a solution.  Also  a retry is not  changing  anything fundamentally like a system would.This means it would be easier to implement on all version moving forward until such time 6210 can be  implemented. This would make  our customer feel mongo is  more stable rather than question stability if  basic constructs like count seem unstable.&lt;/p&gt;

&lt;p&gt;I don&apos;t disagree that a sweeper system like 6210 references would be good, just that its not a complete solution but  repair mechanism for an avoidable issue.&lt;/p&gt;

&lt;p&gt;Thanks&lt;br/&gt;
David &lt;/p&gt;</comment>
                            <comment id="668027" author="greg_10gen" created="Tue, 22 Jul 2014 21:11:20 +0000"  >&lt;p&gt;Got it - but it seems like you&apos;re actually describing the more general problem &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-6210&quot; title=&quot;Clean up data left behind on shards by failed migrations and failed migration cleanups&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-6210&quot;&gt;&lt;del&gt;SERVER-6210&lt;/del&gt;&lt;/a&gt;, which tracks cleaning up after migration failures generally - the migration cleanup code is single-node and doesn&apos;t use the network.  I clarified the title to &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-6210&quot; title=&quot;Clean up data left behind on shards by failed migrations and failed migration cleanups&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-6210&quot;&gt;&lt;del&gt;SERVER-6210&lt;/del&gt;&lt;/a&gt; a bit.&lt;/p&gt;

&lt;p&gt;A retry setting wouldn&apos;t necessarily help - in particular, on stepdown it is incorrect and impossible to retry on the now-secondary node.  Additional chunk state is needed, or a continually running background process monitoring the unowned ranges on the primary.&lt;/p&gt;

&lt;p&gt;EDIT: Also just wanted to clarify that migrations are operations from mongod -&amp;gt; mongod, and are not orchestrated by mongos (though mongos may initiate them).  Cleanup always happens after mongod reports success in v2.6 (and in earlier versions if there are any active cursors).&lt;/p&gt;
</comment>
                            <comment id="661950" author="dmurphy" created="Thu, 17 Jul 2014 15:29:03 +0000"  >&lt;p&gt;Greg &lt;/p&gt;

&lt;p&gt;There are many cases   from  a network glitch, to  a multi phase delete timeout, to a stepDown/Election occurring.&lt;/p&gt;

&lt;p&gt;All of these cases will present an error on the delete, which the moveChunk function just returns true after and make no attempt to  try  a second time.&lt;/p&gt;

&lt;p&gt;Best case would be something  where config.settings.cleanupAttempts defaulted to say 2 or 3. We could even leave it as a  default 0  for 2.4/2.6  but make it a setting that someone could choose to change to make  orphans less likely to be created.&lt;/p&gt;

&lt;p&gt;This is the other side of the orphan question, where the cleanup script can remove them, but we should make best effort to avoid their creation as the will confuse thing until the cleanup command is run.&lt;/p&gt;

&lt;p&gt;David &lt;/p&gt;</comment>
                            <comment id="629511" author="greg_10gen" created="Fri, 20 Jun 2014 20:09:37 +0000"  >&lt;p&gt;I&apos;m not 100% sure I understand what a &quot;sinkhole delete error&quot; is - is this issue a request to continue migration cleanup even after replica set changes?&lt;/p&gt;</comment>
                            <comment id="603496" author="dmurphy" created="Sat, 31 May 2014 12:58:18 +0000"  >&lt;p&gt;It should retry the delete to avoid creating of an orphan. On a sinkhole delete error it fails which means a stepdown will cause orphans. It could retry a couple times then give up to puts a decent effort to reduce this chance.&lt;/p&gt;

&lt;p&gt;Sent from my iPhone&lt;/p&gt;


</comment>
                            <comment id="603376" author="asya" created="Sat, 31 May 2014 07:32:04 +0000"  >&lt;p&gt;Phase 6 is the clean up - this is after the chunk has actually been moved and committed.  Can you clarify in terms of cluster state, rather than step numbers when this would kick in?&lt;/p&gt;

&lt;p&gt;What do you envision for moveRetry?  The move has already been completed at this point so what would be left would be to cleanup orphaned documents.&lt;/p&gt;

</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="42490">SERVER-6210</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10050" key="com.atlassian.jira.toolkit:comments">
                        <customfieldname># Replies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>7.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10055" key="com.atlassian.jira.ext.charting:firstresponsedate">
                        <customfieldname>Date of 1st Reply</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Sat, 31 May 2014 07:32:04 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10052" key="com.atlassian.jira.toolkit:dayslastcommented">
                        <customfieldname>Days since reply</customfieldname>
                        <customfieldvalues>
                                        9 years, 30 weeks ago
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18254" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Dependencies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[]]></customfieldvalue>


                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10057" key="com.atlassian.jira.toolkit:lastusercommented">
                        <customfieldname>Last comment by Customer</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>true</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10056" key="com.atlassian.jira.toolkit:lastupdaterorcommenter">
                        <customfieldname>Last commenter</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>ramon.fernandez@mongodb.com</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_11151" key="com.atlassian.jira.toolkit:LastCommentDate">
                        <customfieldname>Last public comment date</customfieldname>
                        <customfieldvalues>
                            9 years, 30 weeks ago
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    <customfield id="customfield_10051" key="com.atlassian.jira.toolkit:participants">
                        <customfieldname>Participants</customfieldname>
                        <customfieldvalues>
                                        <customfieldvalue>asya.kamsky@mongodb.com</customfieldvalue>
            <customfieldvalue>david.b.murphy.tx@gmail.com</customfieldvalue>
            <customfieldvalue>greg_10gen</customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_14254" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Product Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hrlu13:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|hrzatz:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>120228</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_23361" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Requested By</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10053" key="com.atlassian.jira.ext.charting:timeinstatus">
                        <customfieldname>Time In Status</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_22870" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Triagers</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_14350" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>serverRank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hsgvpr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                    </customfields>
    </item>
</channel>
</rss>