[SERVER-47852] Two primaries can satisfy write concern "majority" after data erased on a node Created: 30/Apr/20 Updated: 27/Oct/23 Resolved: 11/May/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Suganthi Mani | Assignee: | Backlog - Replication Team |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Replication
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
While working on initial sync semantics upgrade downgrade piece, I found a scenario which can lead to 2 primaries in a replica set and both primaries can satisfy write concern "majority". It seems like a safe reconfig bug.
So, at end of this, partition X thinks it's config is [A, B, C, D, F] and partition Y thinks as [A, B, C, D, E]and A being the primary on partition X and B being the primary on partition Y. Note: This problem can also be reproduced with initial sync semantics on. And, I have attached the jstest to demonstrate the problem. |
| Comments |
| Comment by Siyuan Zhou [ 12/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks suganthi.mani for closing the ticket. I updated its title to be more accurate. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Suganthi Mani [ 04/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Thanks everyone for sharing their thoughts. To conclude this discussion, resyncing the existing members of the replica set by erasing the data directory is not safe. It's just like force reconfig which can lead to rollback of majority committed writes. So, I am happy to close this ticket. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 01/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
suganthi.mani, that's a good thought! Paxos, Raft and MongoDB consensus protocol are all designed for stop-fail failure mode. Erasing a node and restarting with the same host and port isn't a stop-fail failure. In the Raft paper - In Search of an Understandable Consensus Algorithm:
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Suganthi Mani [ 01/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Snippet from the companion doc.
Here is the thing, Now, node A --> C1 : [A, B, C, D] CV = 1 CT =1 So, if node D re-syncs to stale primary node A, we will loose majority data. Is it the expected one? Now the config version is C3 which is greater than C2 and C3 is majority committed and node D didn’t help in majority committing of that C3 config. Since, in the past, node D helped node B to majority commit the config C2. Is the argument that node D is still liable and can’t resync from stale primary A? Am I correct? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 01/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
We do explicitly wait for durability of the config document before we write it down during a heartbeat reconfig, so we should not be able to satisfy the config commitment condition until configs are durable on the required nodes.
I'm not sure I entirely follow your point here, but if I understand correctly it seems like you're saying that as long as we lose ("lose" as in permanent data erasure) less than or equal to a minority of nodes, we should be able to guarantee the election safety property for reconfigs. I'm not really sure that's true. Or at least, maybe I have a different perception of the guarantees we aim to provide in this scenario. For example, the permanent loss (erasure of data) of even a single node in a 3 node replica set can lead to loss of a majority committed write (which seems to be what your example with 4 nodes is demonstrating?) With 3 nodes, n1,n2, and n3, if we commit a write on n1 and n2, and then n2 loses its data, n3 can get elected without the committed write. So, in replica set scenarios where we expect to handle permanent data loss of some number (i.e. a minority) of nodes, it doesn't seem that important to expect that election safety is upheld if we already know that committed writes might be lost. That is, safety of the reconfig protocol doesn't really matter since we've already given up the essential safety properties of the underlying protocol. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Suganthi Mani [ 01/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
My response: 2) My understanding is that the scenario in the jstest won’t be valid only if the majority of nodes in the replica set are stopped and restarted with clean data directory. As long as, only minority of nodes are restarted & resynced with clean data directory, we should be able to guarantee the safety property (No two primaries in same term). Let me know if I am missing any crucial detail about config durability. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 01/May/20 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
suganthi.mani My understanding of the scenario outlined in your repro and description is as follows. I find it simpler to illustrate a trace like this without explicitly representing network partitions, since the absence of a message sent between some pair of nodes is equivalent to a network partition between those nodes.
Now, the question is whether n1 could get elected in term 2, which could cause an election safety violation i.e. two primaries in term 2. To do so, it needs a quorum of votes in its current config {n1,n2,n3,n4,n6}. It can garner votes from n1 and n6 since they are both on the same config now. But it needs 1 other voter to form a 3 vote quorum in its 5 node config. Any of the other 3 nodes {n2,n3,n4} cannot cast votes for N1 because they have higher config terms (see It seems to me that the repro depends on the fact that we erase some node's data (step 5 in your description). If you can modify the repro so that the bug still occurs without depending on the deletion of some data, then I think there could be a real bug, but I don't think the repro as written is a bug because of the durability assumptions I mentioned. Let me know if I made an error in any of the above steps. In the context of initial sync semantics, I would imagine that the "resync" you point out in step 5 should be safe as long as we don't erase our config document. My understanding was that initial sync semantics made sure that even with full resyncs, we still guarantee that w:majority writes are never lost. Clearing the entire data directory, however, seems more aggressive than simply re-syncing our replicated data. |