Arbiters with Instant Replay

Introduction

This document proposes an alternative to traditional 3-node clusters (called replica sets) used for achieving high reliability / high availability MongoDB configurations. The idea consists of two parts: have an arbiter with a log of write operations, and have failed nodes return to a majority-confirmed consistent state without refetching documents. The goal is to improve performance and availability while maintaining sufficient reliability guarantees, in particular in single data-center configurations.

MongoDB replica sets consist of a number of nodes each with a mongod instance that maintain a full copy of the database. A single mongod is elected primary and is the only node to accept writes. Replica sets with an even number of data-bearing nodes typically use an additional non-data-bearing arbiter node as a tiebreaker during elections for primary.

The secondary nodes replicate all writes by replaying all write operations stored in the oplog on the primary. A database client can accompany write operations with a write concern, which requires writes to be recorded by a majority of nodes before being considered successful, thus giving full protection against data loss due to single-node failures.

When a node recovers from an unexpected shutdown or hardware failure, it will first find the latest common entry between its oplog and that of another node in the cluster (sync source). Then it will proceed as follows:

roll back all local (orphaned) changes in the oplog by truncating the oplog and getting fresh copies of any documents referenced to restore consistency
retrieve any new oplog entries from the sync source and apply them locally

The 3-node replica set

In this configuration, three mongod instances maintain a full copy of the database, while maintaining full write availability in a degraded state with a single node down.

However, this guarantee comes at the significant cost of tripling the amount of storage required. Additionally, majority writes require not just the latency of reaching stable storage on a local node, but also incur the latency of going through the network to a secondary, committing to stable storage there, and retrieving the acknowledgement of the successful commit. (Note that these latencies are typically overlapped in time, however.)

The 2-node with arbiter replica set

An alternative to the full 3-node setup is to have two data-bearing nodes and an arbiter. The arbiter serves as tie-breaking vote in electing a new primary if the existing one fails.

This configuration addresses the cost aspect of storage by only requiring a doubling of the number of servers compared to a standalone mongod. However, now the degraded state means that writes cannot be written to a majority; such writes risk being rolled back if the remaining primary fails.

When a failed node comes back online, it must first catch up with the primary before full reliability is restored. If a node goes offline for 10 minutes for scheduled service, and it takes an additional 5 minutes for the node to catch up, the system has been unavailable for majority writes for 15 minutes, and all writes accepted during that period would be lost if the remaining node failed. Contrast this with non-degraded operation, where the typical window of vulnerability to rollback is measured in seconds or even fractions of seconds.

Introducing Arbiters with Instant Replay

The main issue with arbiters in the previous section is that they are data-less and therefore do not help establishing a quorum for majority writes. This section proposes a new kind of arbiter, the arbiter with instant replay. Rather than maintaining a copy of the entire database, this new arbiter will just maintain an oplog. It will also begin acknowledging writes. However, the arbiter will not discard entries past the time at which the set became degraded (a primary was elected without the full number of votes). When the oplog fills up, it will instead stop recording and acknowledging writes.

In addition, a change is needed for regular nodes acting as primary, where at all times a checkpoint is kept of a recent majority-committed snapshot. This allows a node to perform a local rollback without requiring fetching documents that were locally modified. So, for recovery, a node can use any node as a sync source, including arbiters.

With the proposed changes, arbiters will be able to confirm majority writes. If one node goes down, any writes taken by the primary in the degraded state will be safe, even if it goes down as well. As long as a majority of nodes, either arbiters or regular full data-bearing nodes, are up, the replica set can recover and accept majority writes.

Finally, in normal operation, it is expected that the arbiter will be able to confirm writes much faster as it doesn't need to perform random I/O but rather only sequential writes to a log, speeding up majority writes.

is related to

SERVER-14539 Full consensus arbiter (i.e. uses an oplog)

Backlog

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Geert Bosch
Participants:: [DO NOT USE] Backlog - Replication Team, Eric Milkie, Geert Bosch, Yury Sokolov
Votes:: 1 Vote for this issue
Watchers:: 17 Start watching this issue

Created:: Oct 08 2015 02:46:23 PM UTC
Updated:: Dec 06 2022 04:42:40 AM UTC

Details

Description

Arbiters with Instant Replay

Introduction

The 3-node replica set

The 2-node with arbiter replica set

Introducing Arbiters with Instant Replay

Attachments

Issue Links

Forms

Activity

People

Dates