<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 03:55:23 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[SERVER-20820] Arbiter with instant replay</title>
                <link>https://jira.mongodb.org/browse/SERVER-20820</link>
                <project id="10000" key="SERVER">Core Server</project>
                    <description>&lt;h1&gt;&lt;a name=&quot;ArbiterswithInstantReplay&quot;&gt;&lt;/a&gt;Arbiters with Instant Replay&lt;/h1&gt;
&lt;h2&gt;&lt;a name=&quot;Introduction&quot;&gt;&lt;/a&gt;Introduction&lt;/h2&gt;
&lt;p&gt;This document proposes an alternative to traditional 3-node clusters (called &lt;b&gt;replica sets&lt;/b&gt;) used for achieving high reliability / high availability MongoDB configurations. The idea consists of two parts: have an arbiter with a log of write operations, and have failed nodes return to a majority-confirmed consistent state without refetching documents. The goal is to improve performance and availability while maintaining sufficient reliability guarantees, in particular in single data-center configurations.&lt;/p&gt;

&lt;p&gt;MongoDB replica sets consist of a number of nodes each with a &lt;tt&gt;mongod&lt;/tt&gt; instance that maintain a full copy of the database. A single &lt;tt&gt;mongod&lt;/tt&gt; is elected &lt;b&gt;primary&lt;/b&gt; and is the only node to accept writes. Replica sets with an even number of data-bearing nodes typically use an additional non-data-bearing &lt;b&gt;arbiter&lt;/b&gt; node as a tiebreaker during elections for primary.&lt;/p&gt;

&lt;p&gt;The &lt;b&gt;secondary&lt;/b&gt; nodes replicate all writes by replaying all write operations stored in the &lt;b&gt;oplog&lt;/b&gt; on the primary. A database client can accompany write operations with a &lt;b&gt;write concern&lt;/b&gt;, which requires writes to be recorded by a majority of nodes before being considered successful, thus giving full protection against data loss due to single-node failures.&lt;/p&gt;

&lt;p&gt;When a node recovers from an unexpected shutdown or hardware failure, it will first find the latest common entry between its oplog and that of another node in the cluster (sync source). Then it will proceed as follows:&lt;/p&gt;

&lt;ul&gt;
	&lt;li&gt;roll back all local (orphaned) changes in the oplog by truncating the oplog and getting fresh copies of any documents referenced to restore consistency&lt;/li&gt;
	&lt;li&gt;retrieve any new oplog entries from the sync source and apply them locally&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;&lt;a name=&quot;The3nodereplicaset&quot;&gt;&lt;/a&gt;The 3-node replica set&lt;/h2&gt;
&lt;p&gt;In this configuration, three &lt;tt&gt;mongod&lt;/tt&gt; instances maintain a full copy of the database, while maintaining full write availability in a degraded state with a single node down.&lt;/p&gt;

&lt;p&gt;However, this guarantee comes at the significant cost of tripling the amount of storage required. Additionally, majority writes require not just the latency of reaching stable storage on a local node, but also incur the latency of going through the network to a secondary, committing to stable storage there, and retrieving the acknowledgement of the successful commit.   (Note that these latencies are typically overlapped in time, however.)&lt;/p&gt;

&lt;h2&gt;&lt;a name=&quot;The2nodewitharbiterreplicaset&quot;&gt;&lt;/a&gt;The 2-node with arbiter replica set&lt;/h2&gt;
&lt;p&gt;An alternative to the full 3-node setup is to have two data-bearing nodes and an arbiter. The arbiter serves as tie-breaking vote in electing a new primary if the existing one fails. &lt;/p&gt;

&lt;p&gt;This configuration addresses the cost aspect of storage by only requiring a doubling of the number of servers compared to a standalone &lt;tt&gt;mongod&lt;/tt&gt;. However, now the degraded state means that writes cannot be written to a majority; such writes risk being rolled back if the remaining primary fails. &lt;/p&gt;

&lt;p&gt;When a failed node comes back online, it must first catch up with the primary before full reliability is restored. If a node goes offline for 10 minutes for scheduled service, and it takes an additional 5 minutes for the node to catch up, the system has been unavailable for majority writes for 15 minutes, and all writes accepted during that period would be lost if the remaining node failed. Contrast this with non-degraded operation, where the typical window of vulnerability to rollback is measured in seconds or even fractions of seconds.&lt;/p&gt;

&lt;h2&gt;&lt;a name=&quot;IntroducingArbiterswithInstantReplay&quot;&gt;&lt;/a&gt;Introducing Arbiters with Instant Replay&lt;/h2&gt;
&lt;p&gt;The main issue with arbiters in the previous section is that they are data-less and therefore do not help establishing a quorum for majority writes. This section proposes a new kind of arbiter, the arbiter with instant replay. Rather than maintaining a copy of the entire database, this new arbiter will just maintain an oplog.  It will also begin acknowledging writes.  However, the arbiter will not discard entries past the time at which the set became degraded (a primary was elected without the full number of votes). When the oplog fills up, it will instead stop recording and acknowledging writes.&lt;/p&gt;

&lt;p&gt;In addition, a change is needed for regular nodes acting as primary, where at all times a checkpoint is kept of a recent majority-committed snapshot. This allows a node to perform a local rollback without requiring fetching documents that were locally modified. So, for recovery, a node can use any node as a sync source, including arbiters.&lt;/p&gt;

&lt;p&gt;With the proposed changes, arbiters will be able to confirm majority writes. If one node goes down, any writes taken by the primary in the degraded state will be safe, even if it goes down as well. As long as a majority of nodes, either arbiters or regular full data-bearing nodes, are up, the replica set can recover and accept majority writes.&lt;/p&gt;

&lt;p&gt;Finally, in normal operation, it is expected that the arbiter will be able to confirm writes much faster as it doesn&apos;t need to perform random I/O but rather only sequential writes to a log, speeding up majority writes.&lt;/p&gt;</description>
                <environment></environment>
        <key id="233524">SERVER-20820</key>
            <summary>Arbiter with instant replay</summary>
                <type id="4" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14710&amp;avatarType=issuetype">Improvement</type>
                                            <priority id="3" iconUrl="https://jira.mongodb.org/images/icons/priorities/major.svg">Major - P3</priority>
                        <status id="10038" iconUrl="https://jira.mongodb.org/images/icons/subtask.gif" description="">Backlog</status>
                    <statusCategory id="2" key="new" colorName="default"/>
                                    <resolution id="-1">Unresolved</resolution>
                                        <assignee username="backlog-server-repl">Backlog - Replication Team</assignee>
                                    <reporter username="geert.bosch@mongodb.com">Geert Bosch</reporter>
                        <labels>
                    </labels>
                <created>Thu, 8 Oct 2015 14:46:23 +0000</created>
                <updated>Tue, 6 Dec 2022 04:42:40 +0000</updated>
                                                                            <component>Replication</component>
                                        <votes>1</votes>
                                    <watches>17</watches>
                                                                                                                <comments>
                            <comment id="3309794" author="JIRAUSER1254419" created="Tue, 28 Jul 2020 14:00:39 +0000"  >&lt;p&gt;Note, classical &quot;capped collection&quot; oplog is also useful in such arbiter: one could be used to store huge week-long oplog for point-in-time recovery (together with properly implemented snapshoting).&lt;/p&gt;</comment>
                            <comment id="1055716" author="geert.bosch" created="Thu, 8 Oct 2015 18:09:42 +0000"  >&lt;p&gt;One way to formalize this would be that the arbiter periodically writes a fullyWritten marker to the oplog (on the primary of course) with writeConcern &lt;/p&gt;
{ w: 3 }
&lt;p&gt; (still assuming the 2-node + arbiter example, in general is should be all nodes), and may never truncate the oplog beyond the last such marker.&lt;/p&gt;

&lt;p&gt;Note about the oplog always being full, that is not necessary. As soon as we have a new fullyWritten marker, we can drop all oplog entries older than that write. This gives a very simple way to truncate the oplog, and will allow the oplog to require writes to datafile, in practice. This works also well with merging of journal/oplog.&lt;/p&gt;

&lt;p&gt;I&apos;m dubious about &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-7200&quot; title=&quot;use oplog as op buffer on secondaries&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-7200&quot;&gt;&lt;del&gt;SERVER-7200&lt;/del&gt;&lt;/a&gt;, but will comment there.&lt;/p&gt;</comment>
                            <comment id="1055649" author="milkie" created="Thu, 8 Oct 2015 17:27:25 +0000"  >&lt;p&gt;A few thoughts:&lt;br/&gt;
&lt;cite&gt;However, the arbiter will not discard entries past the time at which the set became degraded (a primary was elected without the full number of votes)&lt;/cite&gt;&lt;br/&gt;
Not clear how to formalize this.  Also, how would you determine when the set was no longer degraded?&lt;/p&gt;

&lt;p&gt;&lt;cite&gt;When the oplog fills up, it will instead stop recording and acknowledging writes.&lt;/cite&gt;&lt;br/&gt;
Technically, the oplog is always full, in steady state replication.  We will need some way of knowing what data we are deleting.&lt;/p&gt;

&lt;p&gt;&lt;cite&gt;Finally, in normal operation, it is expected that the arbiter will be able to confirm writes much faster as it doesn&apos;t need to perform random I/O but rather only sequential writes to a log, speeding up majority writes&lt;/cite&gt;&lt;br/&gt;
&lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-7200&quot; title=&quot;use oplog as op buffer on secondaries&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-7200&quot;&gt;&lt;del&gt;SERVER-7200&lt;/del&gt;&lt;/a&gt; will essentially be the same thing, with the same benefit, but for all nodes.&lt;br/&gt;
However, another benefit will be that we can keep a user authorization table up to date, thus solving &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-5479&quot; title=&quot;Arbiter in authenticated replica set should allow and require login/auth for admin-only operations&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-5479&quot;&gt;SERVER-5479&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="1055403" author="geert.bosch" created="Thu, 8 Oct 2015 15:02:31 +0000"  >&lt;p&gt;Note that &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-14539&quot; title=&quot;Full consensus arbiter (i.e. uses an oplog)&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-14539&quot;&gt;SERVER-14539&lt;/a&gt; proposes a similar idea, but misses some aspects.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10012">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="146793">SERVER-14539</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10050" key="com.atlassian.jira.toolkit:comments">
                        <customfieldname># Replies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                <customfield id="customfield_12751" key="com.atlassian.jira.plugin.system.customfieldtypes:multiselect">
                        <customfieldname>Assigned Teams</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="25128"><![CDATA[Replication]]></customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_10055" key="com.atlassian.jira.ext.charting:firstresponsedate">
                        <customfieldname>Date of 1st Reply</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Thu, 8 Oct 2015 17:27:25 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10052" key="com.atlassian.jira.toolkit:dayslastcommented">
                        <customfieldname>Days since reply</customfieldname>
                        <customfieldvalues>
                                        3 years, 28 weeks, 1 day ago
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18254" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Dependencies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[]]></customfieldvalue>


                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10057" key="com.atlassian.jira.toolkit:lastusercommented">
                        <customfieldname>Last comment by Customer</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>true</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10056" key="com.atlassian.jira.toolkit:lastupdaterorcommenter">
                        <customfieldname>Last commenter</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>alexander.golin@mongodb.com</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_11151" key="com.atlassian.jira.toolkit:LastCommentDate">
                        <customfieldname>Last public comment date</customfieldname>
                        <customfieldvalues>
                            3 years, 28 weeks, 1 day ago
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    <customfield id="customfield_10051" key="com.atlassian.jira.toolkit:participants">
                        <customfieldname>Participants</customfieldname>
                        <customfieldvalues>
                                        <customfieldvalue>backlog-server-repl</customfieldvalue>
            <customfieldvalue>milkie@mongodb.com</customfieldvalue>
            <customfieldvalue>geert.bosch@mongodb.com</customfieldvalue>
            <customfieldvalue>y.sokolov@joom.com</customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_14254" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Product Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hrksg7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|hrfygn:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_23361" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Requested By</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_22870" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Triagers</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_14350" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>serverRank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hsfl2v:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                    </customfields>
    </item>
</channel>
</rss>