<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 03:47:44 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[SERVER-18453] Avoiding Rollbacks in new Raft based election protocol</title>
                <link>https://jira.mongodb.org/browse/SERVER-18453</link>
                <project id="10000" key="SERVER">Core Server</project>
                    <description>&lt;p&gt;&lt;b&gt;Background&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;The current election protocol gives preference to a node who has the most recent opTime in his oplog. Essentially out of a pool of &quot;electable&quot; nodes, the one with the most advanced opTime wins the election. The rationale for this is to minimize the amount of data rolled back due to a failover.&lt;/p&gt;

&lt;p&gt;The Raft protocol does not have such a component in the election protocol. We do not want to re-introduce it because then we could no longer rely on the robustness/correctness of the Raft protocol in our implementation. In fact it seems this part of the current election protocol has been a source of bugs (e.g. no master elected, etc...). I agree with this design.&lt;/p&gt;

&lt;p&gt;The result is that in MongoDB 3.2, users will be much more likely to see data rollbacked that was written with write concern &amp;lt; majority. From a strict point of view this is ok, as we never have guaranteed that such writes would be safe during failover. From a practical point of view users can today run with w:2 or even w:1 and expect to loose a minimal amount of data on failovers, and in MongoDB 3.2 they could suddenly lose hundreds of milliseconds of transactions and this is arguably a regression in our capability.&lt;/p&gt;


&lt;p&gt;&lt;b&gt;Example test case&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;(Note, this is based on my understanding of our current election protocol, I didn&apos;t actually verify this so far. If we want to address this problem at all, first step would be to run this test. If this test doesn&apos;t trigger the behavior I&apos;m describing for 3.2, it will nevertheless be possible to construct a more contrived test case that will.)&lt;/p&gt;

&lt;p&gt;write concern = 2&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;Primary in North America&lt;/li&gt;
	&lt;li&gt;2 secondaries in North America&lt;/li&gt;
	&lt;li&gt;2 secondaries in Europe&lt;/li&gt;
	&lt;li&gt;All nodes have equal/default priority&lt;/li&gt;
	&lt;li&gt;Network RTT Europe-NA is roughly 100 ms&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;In MongoDB 2.4-3.0, a primary failure is likely to cause one of the other NA secondaries to become primary, because they will of course be 100 ms ahead their EU peers in replication.&lt;/p&gt;

&lt;p&gt;In proposed MongoDB 3.2 design, all secondaries have equal probability to become primary. Therefore there is 50% chance that a EU node becomes primary and therefore the US secondaries would roll back 100 ms of their oplog.&lt;/p&gt;


&lt;p&gt;&lt;b&gt;Proposal&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;I believe it is possible to give more attention to not rolling back oplog as follows:&lt;/p&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Execute the Raft-based election as currently planned&lt;/li&gt;
	&lt;li&gt;When a new primary is elected, he will first check all other reachable nodes for their oplog state.&lt;/li&gt;
	&lt;li&gt;If another node has a more recent opTime, connect to that and copy the missing part of the oplog and apply those on the primary.&lt;/li&gt;
	&lt;li&gt;Now start operating as the new primary.&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;Benefits of my proposal:&lt;/p&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Doesn&apos;t mess with election protocol. Primary is elected as per Raft, then this fix is applied as additional step.&lt;/li&gt;
	&lt;li&gt;Ensures that operations that existing on any one available node will not be rolled back, rather will be applied on the primary&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Drawbacks of my proposal:&lt;/p&gt;

&lt;ul class=&quot;alternate&quot; type=&quot;square&quot;&gt;
	&lt;li&gt;Will make the failover time longer. Potentially this increase in failover time is unbounded too. But it would be possible to create some upper bound for this, for example by continuing the current rule that nodes that are more than 10 seconds behind are considered un-electable. (Such nodes must then be considered failed from a Raft point of view: they cannot participate in elections and therefore not in majority acknowledgements either.)&lt;/li&gt;
	&lt;li&gt;This is also true in cases where an application has been using w:majority and wouldn&apos;t care about losing transactions that exist on one node but weren&apos;t majority acknowledged. Hence users who want to minimize failover time must be able to turn this functionality off. (Possibly this could be turned on/off automatically by the primary detecting which write concerns are used by clients?)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;&lt;b&gt;Proposed priority&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;3.1 Desired or less. The justification for this is that per documentation we don&apos;t promise that rollback wouldn&apos;t happen to non-majority committed data.&lt;/p&gt;</description>
                <environment></environment>
        <key id="203830">SERVER-18453</key>
            <summary>Avoiding Rollbacks in new Raft based election protocol</summary>
                <type id="4" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14710&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="3" iconUrl="https://jira.mongodb.org/images/icons/priorities/major.svg">Major - P3</priority>
                        <status id="6" iconUrl="https://jira.mongodb.org/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="3">Duplicate</resolution>
                                        <assignee username="backlog-server-repl">Backlog - Replication Team</assignee>
                                    <reporter username="henrik.ingo@mongodb.com">Henrik Ingo</reporter>
                        <labels>
                    </labels>
                <created>Wed, 13 May 2015 11:15:38 +0000</created>
                <updated>Tue, 6 Dec 2022 04:51:41 +0000</updated>
                            <resolved>Tue, 6 Sep 2016 20:24:16 +0000</resolved>
                                                                    <component>Replication</component>
                                        <votes>1</votes>
                                    <watches>35</watches>
                                                                                                                <comments>
                            <comment id="1377703" author="milkie" created="Tue, 6 Sep 2016 20:24:16 +0000"  >&lt;p&gt;This idea was implemented as part of the work for &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-23663&quot; title=&quot;New primary syncs from chosen node to catch up with timeout&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-23663&quot;&gt;&lt;del&gt;SERVER-23663&lt;/del&gt;&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="1236898" author="henrik.ingo@10gen.com" created="Fri, 15 Apr 2016 10:17:40 +0000"  >&lt;blockquote&gt;
&lt;p&gt; A newly elected node should never roll anything back&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;I like this thinking. Makes sense.&lt;/p&gt;

&lt;p&gt;And yes, this is of course very unlikely to happen, so not suggesting we optimize for it, just that it is handled correctly one way or another.&lt;/p&gt;
</comment>
                            <comment id="1235652" author="henrik.ingo@10gen.com" created="Thu, 14 Apr 2016 11:38:09 +0000"  >&lt;blockquote&gt;
&lt;p&gt;I just re-read the problem description in this ticket, and I think the example is a little misleading. For one thing, if the two North American nodes are truly 100ms ahead of the two European nodes, they won&apos;t vote for either of the European nodes, denying them the majority required to win the election. The real difference between the old and new protocols is what happens when exactly one of the surviving nodes has a write that none of the other nodes have, and this condition survives until some node&apos;s election timeout expires.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;&lt;del&gt;Not explicit, but I was thinking of a failure where 2 American nodes fail (e.g. are partitioned) and the surviving majority partition consists of 1 American node, which has all the recent transactions, and 2 European nodes which are 100ms behind with transactions, but could vote for each other to elect a new primary and forcing the American node to roll back transactions.&lt;/del&gt;&lt;/p&gt;

&lt;p&gt;Scratch that. Now that I&apos;m re-reading and refreshing my memory, what you say is correct.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The proposal is still interesting, as it could provide a useful knob for a user to use to balance the likelihood of losing w:2 writes in 5-node sets against the amount of time that must pass before the replica set becomes available for writes.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;The proposal surely addresses a corner case (at best... I must emphasize I didn&apos;t actually test any of this). For better or worse, I live with a mind that easily spots corner cases.&lt;/p&gt;</comment>
                            <comment id="1233838" author="schwerin" created="Tue, 12 Apr 2016 20:52:53 +0000"  >&lt;p&gt;I just re-read the problem description in this ticket, and I think the example is a little misleading. For one thing, if the two North American nodes are truly 100ms ahead of the two European nodes, they won&apos;t vote for either of the European nodes, denying them the majority required to win the election. The real difference between the old and new protocols is what happens when exactly one of the surviving nodes has a write that none of the other nodes have, and this condition survives until some node&apos;s election timeout expires.&lt;/p&gt;

&lt;p&gt;In the original election protocol (selected via protocolVersion: 0 in the replica set configuration document in MongoDB 3.2 and later), a node participating in an election may veto a candidate if it believes that it or some third node is up and has an operation in its oplog that the candidate does not have. In the new election protocol (protocolVersion: 1), a node can only refrain from voting for a candidate, and then only if it believes that it has a newer operation in its own oplog than the candidate. However, during the time between the old primary crashing and the candidate standing for election, nodes continue to fetch operations from each others&apos; oplogs, improving the likelihood that the newest operations end up in the majority of nodes&apos; oplogs. I suspect that in practice this lowers the odds of losing the write that is initially present in only one secondary when the original primary first crashes.&lt;/p&gt;

&lt;p&gt;The proposal is still interesting, as it could provide a useful knob for a user to use to balance the likelihood of losing w:2 writes in 5-node sets against the amount of time that must pass before the replica set becomes available for writes.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10010">
                    <name>Duplicate</name>
                                            <outwardlinks description="duplicates">
                                        <issuelink>
            <issuekey id="279177">SERVER-23663</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is duplicated by">
                                        <issuelink>
            <issuekey id="263682">SERVER-22502</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                            <issuelinktype id="10012">
                    <name>Related</name>
                                            <outwardlinks description="related to">
                                        <issuelink>
            <issuekey id="146793">SERVER-14539</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="93307">SERVER-11086</issuekey>
        </issuelink>
                            </outwardlinks>
                                                                <inwardlinks description="is related to">
                                                        </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10050" key="com.atlassian.jira.toolkit:comments">
                        <customfieldname># Replies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                <customfield id="customfield_12751" key="com.atlassian.jira.plugin.system.customfieldtypes:multiselect">
                        <customfieldname>Assigned Teams</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="25128"><![CDATA[Replication]]></customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_13552" key="com.go2group.jira.plugin.crm:crm_generic_field">
                        <customfieldname>Case</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[[500A000000VWsHjIAL]]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10055" key="com.atlassian.jira.ext.charting:firstresponsedate">
                        <customfieldname>Date of 1st Reply</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Wed, 13 May 2015 22:13:59 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10052" key="com.atlassian.jira.toolkit:dayslastcommented">
                        <customfieldname>Days since reply</customfieldname>
                        <customfieldvalues>
                                        7 years, 23 weeks, 1 day ago
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18254" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Dependencies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[]]></customfieldvalue>


                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10057" key="com.atlassian.jira.toolkit:lastusercommented">
                        <customfieldname>Last comment by Customer</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>true</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10056" key="com.atlassian.jira.toolkit:lastupdaterorcommenter">
                        <customfieldname>Last commenter</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>alexander.golin@mongodb.com</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_11151" key="com.atlassian.jira.toolkit:LastCommentDate">
                        <customfieldname>Last public comment date</customfieldname>
                        <customfieldvalues>
                            7 years, 23 weeks, 1 day ago
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    <customfield id="customfield_10051" key="com.atlassian.jira.toolkit:participants">
                        <customfieldname>Participants</customfieldname>
                        <customfieldvalues>
                                        <customfieldvalue>schwerin@mongodb.com</customfieldvalue>
            <customfieldvalue>backlog-server-repl</customfieldvalue>
            <customfieldvalue>milkie@mongodb.com</customfieldvalue>
            <customfieldvalue>henrik.ingo@mongodb.com</customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_14254" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Product Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hrikev:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|hremzj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_23361" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Requested By</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10053" key="com.atlassian.jira.ext.charting:timeinstatus">
                        <customfieldname>Time In Status</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_22870" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Triagers</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_14350" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>serverRank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hsg1dr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                    </customfields>
    </item>
</channel>
</rss>