[SERVER-47331] Rethink the transition from force reconfig to safe reconfig Created: 03/Apr/20  Updated: 29/Oct/23  Resolved: 13/Apr/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Task Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Siyuan Zhou
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
is documented by DOCS-13582 Investigate changes in SERVER-47331: ... Closed
Related
related to SERVER-47495 Ban force reconfig with "newlyAdded" ... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2020-04-20
Participants:

 Description   

When the current config C0 is installed by a "force" reconfig, the next non-force reconfig with config C1 doesn't prevent config divergence if
1. Reconfig C1 has not propagated to a majority of nodes.
2. A failover happens
3. A new reconfig with a different config C2 runs on the new primary.
4. C1 and C2 propagate to disjoint nodes.

The diverged configs may lead to two primaries elected in the same term until C2 (with a higher config term) propagates to a majority of C1. A similar issue is shown in SERVER-47119 with a detailed trace.

In Initial Sync Semantics project, we will give new nodes votes: 0 and run automatic reconfig afterwards to grant them votes afterwards. The config to add the node will face the unsafe but rare case mentioned above. Once the first reconfig passes the aforementioned unsafe period and becomes committed, the following automatic reconfigs will be safe.

To avoid the unsafe case, one idea is to run an automatic reconfig after a force reconfig by increasing the config version and giving it a config term. After this automatic reconfig, following reconfigs will be safe. However, when users run "force" reconfig, it's likely the replset is not stable so that they are willing to risk the loss of committed data. It may not be the right time to run such an automatic reconfig.

Even worse, the automatic reconfig may interrupt the propagation of the "force" reconfig. For example, assuming the current config C0 has 5 nodes, a force reconfig C1 runs on a secondary to convert that secondary to a single node replica set. The force reconfig C1 will increase the version but remove the config term, then propagate to other nodes on their next heartbeats. Nodes in C0 will become REMOVED after learning C1. However, if an automatic reconfig C2 happens on the single node replset, since C2 has a term, C2's term has to be higher than C0 to propagate, which may not be the case if another election occurs in C0. As a result, C2 may not be able to propagate to nodes still in C0. If their terms are the same, nodes in C0 will have a diverged config. They'll be alive and keep running heartbeats to the single node replset. When either of C0 or C2 has a higher term, it will be propagated to the other, potentially overriding the force reconfig.



 Comments   
Comment by Siyuan Zhou [ 13/Apr/20 ]

I wanted to mark this "Done", but I have to go with "Fixed" to enable the downstream attention.

Comment by Siyuan Zhou [ 13/Apr/20 ]

Thanks tess.avitabile and judah.schvimer, closing this.

Comment by Judah Schvimer [ 13/Apr/20 ]

I filed SERVER-47495.

Comment by Tess Avitabile (Inactive) [ 13/Apr/20 ]

Yes, that sounds good to me.

Comment by Judah Schvimer [ 09/Apr/20 ]

Thanks for the summary. I will file the ISS tickets once we agree on the above.

Comment by Siyuan Zhou [ 09/Apr/20 ]

Discussed with judah.schvimer and evin.roesle in person. Since automatic reconfig in ISS runs on top of the first user-initiated reconfig command, their safety is guaranteed if the user-initiated reconfig is a safe reconfig. If the user-initiated reconfig is a force reconfig, then we won't add newlyAdded fields nor run automatic reconfig at all.

The only edge case is when the the user-initiated reconfig is a force reconfig with "newlyAdded" fields. It will trigger automatic reconfig which will run on an unsafe config.

There are a few options to solve this issue.

  1. Ban force reconfig with "newlyAdded" fields
  2. Allow force reconfig with "newlyAdded" fields assuming the following automatic reconfig will be safe in most cases.
  3. Make safe reconfig after force reconfig safer. This will need a significant design that evaluates when it's safe to convert the force reconfig.
  4. Leave the force config with "newlyAdded" fields as-is without running automatic reconfig. This leaves an incomplete state.

We agreed to go with option 1 since "newlyAdded" is an internal field anyway. Beyond the behavioral change, we need to document that the transition from force reconfig to safe reconfig isn't safe. I'm adding downstream change in this ticket. tess.avitabile, does the plan sound good to you?

judah.schvimer, do you mind filing the corresponding ticket in ISS?

Comment by Siyuan Zhou [ 08/Apr/20 ]

If C0 can override the force reconfig, then that seems like a problem even if we don't do an automatic noop reconfig C2.

C0 shouldn't override the force reconfig. The force reconfig should take effect immediately by having a much higher config number.

If this noop automatic reconfig is particularly dangerous, then shouldn't any automatic reconfig be particularly dangerous? If that's the case, then we have to reconsider the ISS project's automatic reconfigs.

I don't think noop automatic reconfig in ISS is dangerous since the reconfig is initiated by a user. After a force reconfig, the first user-initiated safe reconfig to add a node is subject to all the potential issues of force reconfig. Its safety depends on the user as in other cases around force reconfig. In most cases, users would only run reconfig when the system is stable. The following ISS automatic reconfigs will then become safe.

As you mentioned, automatic reconfig won't be safe after force reconfig with "newlyAdded".

If the force reconfig specifies "newlyAdded", then once the primary sees that node is a secondary, the primary will initiate an automatic reconfig to remove "newlyAdded".

I'd suggest banning "newlyAdded" on force reconfig, since "newlyAdded" is supposed to be an internal field and force reconfig is supposed to only used in emergency.

Comment by Judah Schvimer [ 08/Apr/20 ]

I don't follow the final paragraph above.

  1. If C0 can override the force reconfig, then that seems like a problem even if we don't do an automatic noop reconfig C2.
  2. If C2 overrides the force reconfig, isn't that fine since that's what we want in the first place? C2 was created with C1 as the "base config" so it's just a safe version of C1.
  3. If this noop automatic reconfig is particularly dangerous, then shouldn't any automatic reconfig be particularly dangerous? If that's the case, then we have to reconsider the ISS project's automatic reconfigs.

I think that doing an automatic reconfig at the next chance we get would be good to narrow the window where the next reconfig will be unsafe, and could allow us to do other automatic reconfigs safely.

Comment by Judah Schvimer [ 03/Apr/20 ]

This behavior is implemented and tested in SERVER-46347, SERVER-46350, and SERVER-46348.

Comment by Judah Schvimer [ 03/Apr/20 ]

In ISS, a force reconfig will replace the current config verbatim and we will not rewrite it at all. Thus if the force reconfig does not specify "newlyAdded" that would remove the "newlyAdded" field from an existing node (if that node currently had "newlyAdded" specified). If the force reconfig specifies "newlyAdded", then once the primary sees that node is a secondary, the primary will initiate an automatic reconfig to remove "newlyAdded".

Comment by Siyuan Zhou [ 03/Apr/20 ]

judah.schvimer, what's the current design of Initial Sync Semantics if the current config is from a force reconfig? I don't see any problem in terms of the automatic reconfig.

Generated at Thu Feb 08 05:13:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.