[DRIVERS-1307] Investigate changes in PM-1096: Initial Sync Semantics Created: 25/Jun/20  Updated: 27/May/22  Resolved: 29/Jun/20

Status: Closed
Project: Drivers
Component/s: None
Fix Version/s: None

Type: Epic Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Unassigned
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Server Compat: 4.9

 Description   
Downstream Change Summary

Description of Linked Ticket

Epic Summary

Summary

When a node is added to a replica set and goes into initial sync, its addition has an effect on the availability and durability guarantees of the replica set, both while it is in the STARTUP2 state and for some time after it transitions to SECONDARY. Those effects are poorly understood, difficult to reason about, and may not be what people expect. We should modify our behavior such that we no longer break any guarantees as part of initial sync, and generally bring our behavior more in line with user expectations.

Motivation

There are two main problems with initial sync semantics currently. One is that when adding a new voting node it becomes possible for writes acknowledged with w:majority to rollback. This can happen both for new writes that the initial syncing node acknowledges in the case when the initial sync then fails, as well as for writes that had previously been acknowledged before the initial syncing node was added to the set (and changes the definition of majority in the process).

The second problem with initial sync semantics is that since the switch to timestamp-based rollback in 4.0, it is now the case that if a node needs to roll back after completing initial sync but before committing a new operation as SECONDARY, the rollback will fail with an UnrecoverableRollbackError and need a full resync. In earlier versions with rollback via refetch, the rollback would succeed. A full resync can also be required if the node crashes during that same time window. Users may not expect this new behavior, and so it’d be nice to return to the world where once a node has exited the STARTUP2 state and transitioned to SECONDARY that it is stable and able to do anything another secondary in the set will be capable of doing.

Documentation

Scope Document
Design Document

(ARCHIVED) Scope Document



 Comments   
Comment by Esha Bhargava [ 29/Jun/20 ]

No Drivers changes needed.

Generated at Thu Feb 08 08:23:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.