<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 03:33:32 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[SERVER-13995] on stepdown/election, SECONDARY should consume all available oplog</title>
                <link>https://jira.mongodb.org/browse/SERVER-13995</link>
                <project id="10000" key="SERVER">Core Server</project>
                    <description>&lt;p&gt;If the user performs a stepDown on an otherwise healthy PRIMARY and the SECONDARY has N seconds of lag, then that N seconds of activity is lost.  If the PRIMARY is otherwise OK, then the SECONDARY should try to consume all possible oplog while the election takes place (no activity to cluster during this time).&lt;/p&gt;

&lt;p&gt;This allows users to call stepDown on busy sets that may always have lag and ensure they aren&apos;t dropping data on the floor when they don&apos;t need to.&lt;/p&gt;</description>
                <environment></environment>
        <key id="137236">SERVER-13995</key>
            <summary>on stepdown/election, SECONDARY should consume all available oplog</summary>
                <type id="1" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14703&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.mongodb.org/images/icons/priorities/major.svg">Major - P3</priority>
                        <status id="6" iconUrl="https://jira.mongodb.org/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="milkie@mongodb.com">Eric Milkie</assignee>
                                    <reporter username="kennygorman">Kenny Gorman</reporter>
                        <labels>
                    </labels>
                <created>Mon, 19 May 2014 22:41:55 +0000</created>
                <updated>Tue, 17 Feb 2015 20:22:44 +0000</updated>
                            <resolved>Tue, 17 Feb 2015 20:22:44 +0000</resolved>
                                                                    <component>Replication</component>
                                        <votes>8</votes>
                                    <watches>21</watches>
                                                                                                                <comments>
                            <comment id="830912" author="milkie" created="Tue, 17 Feb 2015 19:49:21 +0000"  >&lt;p&gt;The replSetStepDown command now takes a timeout period to allow an admin to avoid rollbacks.&lt;/p&gt;</comment>
                            <comment id="828152" author="schwerin" created="Thu, 12 Feb 2015 19:56:30 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=kennygorman&quot; class=&quot;user-hover&quot; rel=&quot;kennygorman&quot;&gt;kennygorman&lt;/a&gt;, in 3.0 we&apos;ve changed the stepDown behavior as part of &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-15861&quot; title=&quot;Add argument to replSetStepDown to allow users to specify how long to wait for secondaries to catch up&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-15861&quot;&gt;&lt;del&gt;SERVER-15861&lt;/del&gt;&lt;/a&gt;.  I just noticed that the description of that ticket is underwhelming, but the &lt;a href=&quot;http://docs.mongodb.org/master/reference/command/replSetStepDown/&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;documentation&lt;/a&gt; describes it.  Briefly, the primary blocks new writes and waits a user-controllable period of time (default 10 seconds) for any electable secondary to catch up before stepping down.  If one wants the primary to step down even if no electable node is caught up after the catch-up period, you can use the force: true argument.  If you want to force the primary to step down without waiting for any electable secondary to catch up, you  must now specify both force: true and secondaryCatchupPeriodSecs to 0.&lt;/p&gt;

&lt;p&gt;I&apos;m tempted to resolve this issue as fixed by &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-15861&quot; title=&quot;Add argument to replSetStepDown to allow users to specify how long to wait for secondaries to catch up&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-15861&quot;&gt;&lt;del&gt;SERVER-15861&lt;/del&gt;&lt;/a&gt;.  Do you agree?  You can test it out in the 3.0 release candidates.&lt;/p&gt;</comment>
                            <comment id="670793" author="kennygorman" created="Thu, 24 Jul 2014 23:03:45 +0000"  >&lt;p&gt;Any more thoughts about implementing a fix for this condition?&lt;/p&gt;</comment>
                            <comment id="592055" author="charity@parse.com" created="Wed, 21 May 2014 19:09:59 +0000"  >&lt;p&gt;Something very similar bit us recently.  We had two secondaries doing foreground indexes, one ~3 hours behind but priority 0 and the other ~6 hours behind.  Heartbeat flapping forced an election and it rolled back to the secondary that was ~6 hours behind.  The old primary entered ROLLBACK state (but couldn&apos;t roll back because it had more then 300 mb of ops) and the other secondary entered FATAL state because it was ahead of the new primary.  This was pretty terrible.  Not sure what the correct solution is here, maybe back off forcing an election when heartbeats are flapping, or when the secondary is very far behind?&lt;/p&gt;

&lt;p&gt;Replaying the ops on the secondary when it is not &lt;b&gt;too&lt;/b&gt; far behind seems like possibly a good idea, not sure what other terrible failure scenarios this could cause.&lt;/p&gt;</comment>
                            <comment id="591749" author="dmurphy" created="Wed, 21 May 2014 15:47:58 +0000"  >&lt;p&gt;Hi Andy ,&lt;/p&gt;

&lt;p&gt;I agree however in one case we had&lt;/p&gt;


&lt;p&gt;Primary was healthy&lt;br/&gt;
Secondary1 was catching up (3:00 hours behind)&lt;br/&gt;
Secondary2 was having a secondary index build ( as its not on 2.6 and cant use the new background index system)&lt;/p&gt;

&lt;p&gt;A heartbeat issue  occurred  causing an election ,  the old primary was not selected because of this  error and the secondary that was behind was.  This resulted in 3hours of w:1 type data being removed. Which should not have been the case. If majority was used you are correct and this could not happen , however  I think w:1 is still  a  very common level to be on and we should protect it also.&lt;/p&gt;

&lt;p&gt;Additionally there is a case where if you have 1 secondary catching up and something happens to another around the time you do a stepDown, only 1 viable candidate is found,  other than the old primary. In this  case  I think the position is we should  either re-elect the old primary as the secondary is  lagged and would do a rollback , even though the other machine that went down was more updated. &lt;/p&gt;

&lt;p&gt;My thought is in today&apos;s model with no change we should have a situation where  an election holds for  5 minute ( like  in the veto case of a primary sometimes)  waiting for  the 3rd secondary which is more updated to come back online and catch up, or  for  the timeout to hit so the old primary can be elected and the &quot;lagged&quot; secondary should not be considered valid. &lt;/p&gt;

&lt;p&gt;I would ask which is more dangerous blocking  5 minutes of  w:0  or  deleting 3 hours of data for both w:0 and w:1?&lt;/p&gt;
</comment>
                            <comment id="591730" author="schwerin" created="Wed, 21 May 2014 15:31:58 +0000"  >&lt;p&gt;To be clear, &lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=dmurphy&quot; class=&quot;user-hover&quot; rel=&quot;dmurphy&quot;&gt;dmurphy&lt;/a&gt;, when a client gets a response for a &quot;w:majority&quot; write, only forced reconfigurations should cause that write to roll back.  Kenny&apos;s case and the proposal I derived from it really only apply to writes confirmed with less than &quot;w:majority&quot;, and &lt;em&gt;maybe&lt;/em&gt; to giving &quot;w:majority&quot; writes an opportunity to complete and respond before voluntary stepdown.  In those cases, while the application should be prepared for those writes to roll back due to failover, it&apos;s a favor to the operator not to roll them back during planned step-down.&lt;/p&gt;</comment>
                            <comment id="591694" author="dmurphy" created="Wed, 21 May 2014 15:03:47 +0000"  >&lt;p&gt;To Kenny&apos;s point on a parameter I think it would be great if nodes could veto themselves if they are to old  and catching up. &lt;/p&gt;

&lt;p&gt;I do  wonder if we did that would we also need a  setting somewhere (conf or config.settings) to  set if mongo should allow vs disallow rollbacks.  &lt;/p&gt;

&lt;p&gt;We as a community dont like to have configuration options, however this is a major tunable for determining how HA should work,  in that its  prolonged lack of primary vs  potential data-loss/logical corruption if a rollback is preformed.&lt;/p&gt;

&lt;p&gt;Andy to your point  I think such a setting hold help re-mediate the issue with fire and forget write  as you could plan you logic around this. &lt;/p&gt;

&lt;p&gt;Also I would assert that since its fire and forget and we made no guarantee the DB saved this data a rollback is not &quot;critical&quot; for them in the  way  the higher level write concerns expect no such data removal since the DB  reported a success on saving it to 1+N nodes and or journals.&lt;/p&gt;</comment>
                            <comment id="591674" author="schwerin" created="Wed, 21 May 2014 14:47:02 +0000"  >&lt;p&gt;Kenny,&lt;/p&gt;

&lt;p&gt;I believe you&apos;re proposing approximately the following behavior, during operator-driven step downs. In addition to the specifying the duration of the demurral period, the operator specifies the duration of the &quot;catch-up period&quot;, during which time the set will not accept writes, but the original node remains primary.  When that parameter is specified, the primary waits for secondaries to catch up to the last accepted write.  Once a majority of members have oplogged that write, or when the period expires, the primary actually steps down and an election takes place.  If the demurral period is shorter than the catch-up period or a &quot;do not step down if secondaries not caught up&quot; flag is set, and sufficient secondaries have not caught up by the end of the catch-up period, then the primary would simply not step down, or would stand for reelection.&lt;/p&gt;

&lt;p&gt;During the catch-up period, writes would be rejected.  There might be some sticky issues around that for fire-and-forget writes, but I haven&apos;t yet found a flaw with the core of the algorithm.&lt;/p&gt;</comment>
                            <comment id="591606" author="kennygorman" created="Wed, 21 May 2014 13:56:17 +0000"  >&lt;p&gt;Another possibility would be to actually provide a parameter to stepDown() asking for the election not to finish until N seconds have elapsed in order for slaves to catch up.  If they aren&apos;t caught up, then itself is re-elected after that period.  Like a &quot;no for serious, don&apos;t rollback&quot; mode.&lt;/p&gt;</comment>
                            <comment id="591586" author="kennygorman" created="Wed, 21 May 2014 13:45:38 +0000"  >&lt;p&gt;Eric,&lt;/p&gt;

&lt;p&gt;Thanks for the reply.&lt;/p&gt;

&lt;p&gt;Is there a SERVER ticket for the design?  It sounds like the design still won&apos;t guarantee we won&apos;t rollback.  &lt;/p&gt;

&lt;p&gt;The use case here is that sometimes we call stepDown() on healthy primaries in order to move around the workloads.  When we do this, we would like to instruct MongoDB to not tolerate data loss (it shouldn&apos;t need to lose data).  We need a way to communicate that to MongoDB.&lt;/p&gt;

&lt;p&gt;The problem is that:&lt;/p&gt;
&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Data loss is almost guaranteed  (any reasonable workload ensures lag &amp;gt;0)&lt;/li&gt;
	&lt;li&gt;The amount of loss is essentially random (we can&apos;t control the lag directly)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;It would be OK to have the stepDown take longer while the soon to be PRIMARY consumes 100% of possible oplog to ensure it&apos;s as up to data as possible.   Something roughly like:&lt;/p&gt;

&lt;p&gt;if stepDown() &amp;amp;&amp;amp; applyAllOplogMode:&lt;br/&gt;
  if primary_is_still_up():&lt;br/&gt;
     keep_applying_oplog(until_empty_cursor)&lt;br/&gt;
     perform_election()&lt;/p&gt;

&lt;p&gt;Where applyAllOplogMode is set via command line at startup.&lt;/p&gt;




</comment>
                            <comment id="591483" author="milkie" created="Wed, 21 May 2014 11:41:05 +0000"  >&lt;p&gt;We are planning on changing the order of when the oplog is written versus when the writes are applied, on secondary nodes.  This will have a big impact on reducing the amount of potential data rolled back after a primary demotion.&lt;/p&gt;</comment>
                            <comment id="590666" author="kennygorman" created="Tue, 20 May 2014 19:45:55 +0000"  >&lt;p&gt;Any thoughts?&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="166222">SERVER-15861</issuekey>
        </issuelink>
                            </outwardlinks>
                                                        </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10050" key="com.atlassian.jira.toolkit:comments">
                        <customfieldname># Replies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>12.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10055" key="com.atlassian.jira.ext.charting:firstresponsedate">
                        <customfieldname>Date of 1st Reply</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Tue, 20 May 2014 13:27:41 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10052" key="com.atlassian.jira.toolkit:dayslastcommented">
                        <customfieldname>Days since reply</customfieldname>
                        <customfieldvalues>
                                        9 years, 1 day ago
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18254" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Dependencies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[]]></customfieldvalue>


                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_10057" key="com.atlassian.jira.toolkit:lastusercommented">
                        <customfieldname>Last comment by Customer</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>true</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10056" key="com.atlassian.jira.toolkit:lastupdaterorcommenter">
                        <customfieldname>Last commenter</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>ramon.fernandez@mongodb.com</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_11151" key="com.atlassian.jira.toolkit:LastCommentDate">
                        <customfieldname>Last public comment date</customfieldname>
                        <customfieldvalues>
                            9 years, 1 day ago
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10032" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Operating System</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10026"><![CDATA[ALL]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_10051" key="com.atlassian.jira.toolkit:participants">
                        <customfieldname>Participants</customfieldname>
                        <customfieldvalues>
                                        <customfieldvalue>schwerin@mongodb.com</customfieldvalue>
            <customfieldvalue>charity@parse.com</customfieldvalue>
            <customfieldvalue>david.b.murphy.tx@gmail.com</customfieldvalue>
            <customfieldvalue>milkie@mongodb.com</customfieldvalue>
            <customfieldvalue>kennygorman</customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_14254" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Product Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hrlupr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|hryy6v:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>118157</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_23361" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Requested By</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10053" key="com.atlassian.jira.ext.charting:timeinstatus">
                        <customfieldname>Time In Status</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_22870" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Triagers</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_14350" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>serverRank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hricdb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                    </customfields>
    </item>
</channel>
</rss>