[SERVER-47852] Two primaries can satisfy write concern "majority" after data erased on a node Created: 30/Apr/20  Updated: 27/Oct/23  Resolved: 11/May/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Backlog - Replication Team
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 2_primaries_with_initial_sync_semantics_on.js    
Issue Links:
Related
related to SERVER-47363 Define correct use of MemberIds in At... Closed
is related to SERVER-54746 Two primaries in a replica set can sa... Closed
Assigned Teams:
Replication
Operating System: ALL
Steps To Reproduce:

/*
 * This test demonstrates 2 primaries existing in the same replica set and both primaries
 * can satisfy majority write concern.
 *
 * Basically the test simulates below scenario
 * Note: 'P' refers primary, 'S' refers to secondary.
 * 1) [P, S0, S1, S2] // Start a 4 node replica set.
 * 2) Partition A: [P] Partition B: [S0->P, S1, S2] // Create n/w partition A & B.
 * 3) Partition A: [P] Partition B: [P, S1, S2, S3] // Add a new node S3 to Partition B.
 * 4) Partition A: [P, S2] Partition B: [P, S1, S3] // Restart/resync S2 and move back to partition A pool.
 * 5) Partition A: [P, S2, S4] Partition B: [P, S1, S3] // Add a new node S4 to Partition A.
 */
load('jstests/replsets/rslib.js');
(function() {
'use strict';
 
// Start a 4 node replica set.
// [P, S0, S1, S2]
const rst = new ReplSetTest({
    nodes: [{}, {}, {rsConfig: {priority: 0}}, {rsConfig: {priority: 0}}],
    nodeOptions: {setParameter: {enableAutomaticReconfig: false}},
    useBridge: true
});
 
// Disable Chaining and disable automatic election from happening due to liveness timeout.
var config = rst.getReplSetConfig();
config.settings = config.settings || {};
config.settings["chainingAllowed"] = false;
config.settings["electionTimeoutMillis"] = ReplSetTest.kForeverMillis;
 
rst.startSet();
rst.initiate(config);
 
const dbName = jsTest.name();
const collName = "coll";
 
let primary1 = rst.getPrimary();
const primaryDB = primary1.getDB(dbName);
const primaryColl = primaryDB[collName];
const secondaries = rst.getSecondaries();
 
jsTestLog("Do a document write");
assert.commandWorked(primaryColl.insert({_id: 1, x: 1}, {"writeConcern": {"w": 4}}));
rst.awaitReplication();
 
// Create a n/w partition such that we result in this state [P] [S0, S1, S2].
jsTestLog("Disconnect primary1 from all secondaries");
primary1.disconnect([secondaries[0], secondaries[1], secondaries[2]]);
 
jsTestLog("Make secondary0 to be become primary");
assert.commandWorked(secondaries[0].adminCommand({"replSetStepUp": 1}));
 
// Now our network topology will be [P] [S0->P, S1, S2].
jsTestLog("Wait for secondary0 to become master");
checkLog.contains(secondaries[0], "Transition to primary complete");
let primary2 = secondaries[0];
 
jsTestLog("Adding a new voting node to the replica set");
const node5 = rst.add({
    rsConfig: {priority: 0, votes: 1},
    setParameter: {
        'numInitialSyncAttempts': 1,
        'enableAutomaticReconfig': false,
    }
});
 
// Simulate this network topology [P] [P, S1, S2, S3].
node5.disconnect([primary1]);
 
// Run a reconfig command on the primary 2 to add node 5.
var config = rst.getReplSetConfigFromNode(1);
var newConfig = rst.getReplSetConfig();
config.members = newConfig.members;
config.version += 1;
assert.adminCommandWorkedAllowingNetworkError(
    primary2, {replSetReconfig: config, maxTimeMS: ReplSetTest.kDefaultTimeoutMS});
 
// Make sure the new writes is able to propagate to the newly added node.
jsTestLog("Do a document write on the primary2");
assert.commandWorked(
    primary2.getDB(dbName)[collName].insert({_id: 2, x: 2}, {"writeConcern": {"w": 4}}));
 
// Now make sure, we get into this state [P, S2] [P, S1, S3].
jsTestLog("Disconnect Secondary2 from primary2 and reconnect to primary1");
secondaries[2].disconnect([secondaries[0], secondaries[1], node5]);
secondaries[2].reconnect([primary1]);
 
jsTestLog("Kill and restart Secondary2");
rst.stop(3, 9, {allowedExitCode: MongoRunner.EXIT_SIGKILL}, {forRestart: true});
jsTestLog("Restarting the node.");
var restartNode = rst.start(3, {startClean: true}, true);
 
jsTestLog("wait for secondary state");
waitForState(restartNode, ReplSetTest.State.SECONDARY);
 
jsTestLog("Adding a new voting node to the replica set");
const node6 = rst.add({
    rsConfig: {priority: 0, votes: 1},
    setParameter: {
        'numInitialSyncAttempts': 1,
        'enableAutomaticReconfig': false,
    }
});
 
// Simulate this network topology [P, S2, S4] [P, S1, S3].
node6.disconnect([secondaries[0], secondaries[1], node5]);
 
// Run a reconfig command on the primary1 to add node 6
config = rst.getReplSetConfigFromNode(0);
newConfig = rst.getReplSetConfig();
// Only reset members.
config.members[4] = newConfig.members[5];
config.version += 1;
assert.adminCommandWorkedAllowingNetworkError(
    primary1, {replSetReconfig: config, maxTimeMS: ReplSetTest.kDefaultTimeoutMS});
 
jsTestLog(
    "Do some document writes to verify we have 2 primaries and both satisfy write concern majority");
assert.commandWorked(primary1.getDB(dbName)[collName].insert({_id: 3, x: "primary1 Doc"},
                                                             {"writeConcern": {"w": "majority"}}));
assert.commandWorked(primary1.getDB(dbName)[collName].insert({_id: 4, x: "primary1 Doc"},
                                                             {"writeConcern": {"w": 3}}));
assert.commandWorked(primary1.getDB(dbName)[collName].insert({_id: 5, x: "primary1 Doc"},
                                                             {"writeConcern": {"w": "majority"}}));
assert.commandWorked(primary2.getDB(dbName)[collName].insert({_id: 6, x: "primary2 Doc"},
                                                             {"writeConcern": {"w": "majority"}}));
 
jsTestLog("Verify our primary1 can be get re-elected.");
assert.commandWorked(primary1.adminCommand({"replSetStepDown": 1000, "force": true}));
assert.commandWorked(primary1.adminCommand({replSetFreeze: 0}));
assert.commandWorked(primary1.adminCommand({"replSetStepUp": 1}));
 
jsTestLog("Test completed");
rst.stopSet();
}());

Participants:

 Description   

While working on initial sync semantics upgrade downgrade piece, I found a scenario which can lead to 2 primaries in a replica set and both primaries can satisfy write concern "majority". It seems like a safe reconfig bug.
Below is the scenario. Assume 'P' is primary and 'S' indicates secondary and assume all the nodes we are dealing in the scenario are voters (votes:1).
1) Start a 4 node replica set A, B, C, D ==> [A(P), B (S), C(S), D(S)] , write/elect quorum = 3.
2) Create n/w partition X & Y Partition X: [A(P)] Partition Y: [B(S), C(S), D(S)].
3) Step up the node B Partition X: [A(P)] Partition Y: [B(P), C(S), D(S)].
4) Add a new node E to Partition Y using reconfig cmd. Partition X: [A(P)] Partition Y: [B(P), C(S), D(S) E(S)], write/elect quorum will still be 3.
5) Now, move node D to partition X pool  and make it to restart and resync  from node A and Partition X: [A(P), D(S)] Partition Y: [B(P), C(S), E(S)] .
6) Now, add a new node F to Partition X using reconfig cmd. Partition X: [A(P), D(S), F(S)] Partition Y: [B(P), C(S), E(S)]

  • To be noted, prerequisite for a reconfig cmd is that
    the current config should be C~i~ majority committed and all committed entries in the previous config C~i-1~ should also be committed in the current config C~i~ . Since for node A, it's current config C~i~  is[A, B, C, D] (commit quorum = 3) which is majority committed and all committed entries in the previous config C~i-1~ is also committed (+ check quorum step -it's also able to contact majority of nodes A, D, F in the new config), node A was successfully able to run reconfig cmd by updating and persisting the new config document i.e., from [A,B, C, D] -> [A, B, C, D, F] and it's write/elect quorum will still be 3.

So, at end of this, partition X thinks it's config is [A, B, C, D, F] and partition Y thinks as [A, B, C, D, E]and A being the primary on partition X and B being the primary on partition Y.

Note: This problem can also be reproduced with initial sync semantics on. And, I have attached the jstest to demonstrate the problem.



 Comments   
Comment by Siyuan Zhou [ 12/May/20 ]

Thanks suganthi.mani for closing the ticket. I updated its title to be more accurate.

Comment by Suganthi Mani [ 04/May/20 ]

Thanks everyone for sharing their thoughts.
I totally agree the scenario I mentioned in this ticket which requires resyncing the existing member of the replica set by erasing the data directory,  is not a fail-stop failure. Because fail-stop failure's stable storage property requires that stable storage of the failed node should not be affected by the (crash) failure and can always be read. And, our safe reconfig is designed for fail-stop failures.

To conclude this discussion, resyncing the existing members of the replica set by erasing the data directory is not safe. It's just like force reconfig which can lead to rollback of majority committed writes.
judah.schvimer pointed me to this SERVER-47363 ticket which will update the mongoDB documentation stating that the recommended approach to resync nodes will be removing the resyncing node from the replica set and then, reconfiguring the replica set to add that node back with different member id.

So, I am happy to close this ticket.

Comment by Siyuan Zhou [ 01/May/20 ]

suganthi.mani, that's a good thought!

Paxos, Raft and MongoDB consensus protocol are all designed for stop-fail failure mode. Erasing a node and restarting with the same host and port isn't a stop-fail failure.

In the Raft paper - In Search of an Understandable Consensus Algorithm:

Servers are assumed to fail by stopping; they may later recover from state on stable storage and rejoin the cluster.

In Paxos Made Simple:

Assume that agents can communicate with one another by sending messages. We use the customary asynchronous, non-Byzantine model, in which:

  • Agents operate at arbitrary speed, may fail by stopping, and may restart. Since all agents may fail after a value is chosen and then restart, a solution is impossible unless some information can be remembered by an agent that has failed and restarted.
  • Messages can take arbitrarily long to be delivered, can be duplicated, and can be lost, but they are not corrupted.
Comment by Suganthi Mani [ 01/May/20 ]

Snippet  from the companion doc. 

Always re-sync from a primary node
This is easy for us to enable and always is possible. This does not guarantee we don't roll back majority committed writes though. If the chosen primary is a stale primary that is missing writes that the re-syncing node helped commit after the chosen primary was elected, then those missing writes could be rolled back. This is incredibly unlikely though, so most of the time this will prevent majority committed write loss.

Here is the thing,
Note: CT - config term, CV - config version
Initially I have a config C1 : [A, B, C, D] CV = 1  CT =1. A primary not reachable, B steps up.
In order to step up , B does a reconfig , such that, we have  config C2 : [A, B, C, D] CV = 1  CT =2.
Now, we are trying add node E via reconfig cmd. In order to run C3: [A, B, C, D, E], we first run the config commitment check
1) Config ci majority committed
2) All committed entries in the previous config C_i-1 are committed in the current config C_i.
So, for these 2 things to happen , node D should have voted for majority writes (config commitment check).
Now, node B is able to persist the new config C3  [A, B, C, D, E]  CV = 2  CT =2 and able to replicate to majority of node, assume, it is B, C, E. 

Now,

node A --> C1 : [A, B, C, D] CV = 1  CT =1
node B --->  C3 : [A, B, C, D, E] CV = 2  CT =2
node C-----> C3 : [A, B, C, D, E] CV = 2  CT =2
node D ---> C2 : [A, B, C, D] CV = 1  CT =2.
node E ----> C3:  [A, B, C, D, E] CV = 2  CT =2

So, if node D re-syncs to stale primary node A, we will loose majority data. Is it the expected one?  Now the config version is C3 which is greater than C2 and C3 is majority committed and node D didn’t help in majority committing of that C3 config. Since, in the past, node D helped node B to majority commit the config C2. Is the argument that  node D is still liable and can’t resync from stale primary A?  Am I correct?

Comment by William Schultz (Inactive) [ 01/May/20 ]

1) I am not sure whether currently the node acknowledges that the config as durable.

We do explicitly wait for durability of the config document before we write it down during a heartbeat reconfig, so we should not be able to satisfy the config commitment condition until configs are durable on the required nodes.

2) My understanding is that the scenario in the jstest won’t be valid only if the majority of nodes in the replica set are stopped and restarted with clean data directory.

I'm not sure I entirely follow your point here, but if I understand correctly it seems like you're saying that as long as we lose ("lose" as in permanent data erasure) less than or equal to a minority of nodes, we should be able to guarantee the election safety property for reconfigs. I'm not really sure that's true. Or at least, maybe I have a different perception of the guarantees we aim to provide in this scenario. For example, the permanent loss (erasure of data) of even a single node in a 3 node replica set can lead to loss of a majority committed write (which seems to be what your example with 4 nodes is demonstrating?) With 3 nodes, n1,n2, and n3, if we commit a write on n1 and n2, and then n2 loses its data, n3 can get elected without the committed write. So, in replica set scenarios where we expect to handle permanent data loss of some number (i.e. a minority) of nodes, it doesn't seem that important to expect that election safety is upheld if we already know that committed writes might be lost. That is, safety of the reconfig protocol doesn't really matter since we've already given up the essential safety properties of the underlying protocol.

Comment by Suganthi Mani [ 01/May/20 ]

william.schultz 

The crash and restart violates durability assumptions of reconfig i.e. once a node acknowledges a config as durable you are guaranteed it will always have that config or a newer one.

My response:  
1) I am not sure whether currently the node acknowledges that the config as durable.  My understanding of the config await step is that, the majority of nodes have acknowledged that it has received the new config. But that doesn't mean that the node has persisted the newly received config to disk (i.e., new config data journal/logs got persisted to disk) and will be able to recover the new config on a restart after unexpected node crash.

2) My understanding is that the scenario in the jstest won’t be valid only if the majority of nodes in the replica set are stopped and restarted with clean data directory. As long as, only minority of nodes are restarted & resynced with clean data directory, we should be able to guarantee the safety property (No two primaries in same term).
IFor e.g. Let's assume we have a 4 node replica set A, B, C, D. Node A is primary. Assume Node A writes W1 with ts(1) and it gets replicated to node B and node C. As a result, the last Applied of Node B, C becomes ts(1). And, that leads node A to forward its lastCommittedTs as ts(1) (i.e., majority of nodes have applied ts(1)). Now, it doesn't mean that since node C has previously acknowledged the W1 write, if node C get restarted with clean directory, the replica set can lose the majority committed data W1.

Let me know if I am missing any crucial detail about config durability.

Comment by William Schultz (Inactive) [ 01/May/20 ]

suganthi.mani My understanding of the scenario outlined in your repro and description is as follows. I find it simpler to illustrate a trace like this without explicitly representing network partitions, since the absence of a message sent between some pair of nodes is equivalent to a network partition between those nodes.

CT="Config Term"
CV="Config Version"
{} means a node just started up and hasn't received any config yet.
Pn means a node is primary in term "n".
 
We start up with n1 as primary and all nodes have the same config.
 
<Init>
n1: {n1,n2,n3,n4}    CT=1,CV=1 P1
n2: {n1,n2,n3,n4}    CT=1,CV=1
n3: {n1,n2,n3,n4}    CT=1,CV=1
n4: {n1,n2,n3,n4}    CT=1,CV=1
n5: {}
n6: {}
 
P2 gets elected primary and commits a new config in term 2.
 
<State 1>
n1: {n1,n2,n3,n4}    CT=1,CV=1 P1
n2: {n1,n2,n3,n4}    CT=2,CV=1 P2
n3: {n1,n2,n3,n4}    CT=2,CV=1
n4: {n1,n2,n3,n4}    CT=2,CV=1
n5: {}
n6: {}
 
P2 reconfigs to add in n5. This is allowed since it committed a config in its own term (2) already.
 
<State 2>
n1: {n1,n2,n3,n4}    CT=1,CV=1 P1
n2: {n1,n2,n3,n4,n5} CT=2,CV=2 P2
n3: {n1,n2,n3,n4}    CT=2,CV=1
n4: {n1,n2,n3,n4}    CT=2,CV=1
n5: {}
n6: {}
 
P1 reconfigs to add in node 6 which is allowed since it already committed a config in its own term (1). It also propagates its config to N6
 
<State 3>
n1: {n1,n2,n3,n4,n6} CT=1,CV=2 P1 
n2: {n1,n2,n3,n4,n5} CT=2,CV=2 P2
n3: {n1,n2,n3,n4}    CT=2,CV=1
n4: {n1,n2,n3,n4}    CT=2,CV=1
n5: {}
n6: {n1,n2,n3,n4,n6} CT=1,CV=2

Now, the question is whether n1 could get elected in term 2, which could cause an election safety violation i.e. two primaries in term 2. To do so, it needs a quorum of votes in its current config {n1,n2,n3,n4,n6}. It can garner votes from n1 and n6 since they are both on the same config now. But it needs 1 other voter to form a 3 vote quorum in its 5 node config. Any of the other 3 nodes {n2,n3,n4} cannot cast votes for N1 because they have higher config terms (see SERVER-46387). So, the election in term 2 cannot proceed for N1. In your repro, however, one of the nodes out of those 3 (n2,n3,n4) is crashed and its data wiped out, so that it ends up being able to cast a vote for N1. To be fair, I did not actually check the logs of the repro to confirm this is exactly what happened, but that is my guess. The crash and restart violates durability assumptions of reconfig i.e. once a node acknowledges a config as durable you are guaranteed it will always have that config or a newer one.

It seems to me that the repro depends on the fact that we erase some node's data (step 5 in your description). If you can modify the repro so that the bug still occurs without depending on the deletion of some data, then I think there could be a real bug, but I don't think the repro as written is a bug because of the durability assumptions I mentioned. Let me know if I made an error in any of the above steps.

In the context of initial sync semantics, I would imagine that the "resync" you point out in step 5 should be safe as long as we don't erase our config document. My understanding was that initial sync semantics made sure that even with full resyncs, we still guarantee that w:majority writes are never lost. Clearing the entire data directory, however, seems more aggressive than simply re-syncing our replicated data.

Generated at Thu Feb 08 05:15:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.