[SERVER-54389] Audit internal uses of force reconfig and consider using safe reconfig Created: 08/Feb/21 Updated: 29/Oct/23 Resolved: 29/Mar/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 5.0.0-rc0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Pavithra Vetriselvan | Assignee: | Jason Chan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Sprint: | Repl 2021-03-22, Repl 2021-04-05 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
When using force reconfig, we will bypass the noop write while also clearing the committed snapshot. Previously, internal users of doReplSetReconfig only changed the config term and version (stepup, upgrade/downgrade). We probably favored force reconfigs in these scenarios because they are faster and still safe. Since the content of the config did not change, and therefore did not affect the quorum condition, we did not drop snapshots. However, in |
| Comments |
| Comment by Githook User [ 29/Mar/21 ] |
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}Message: |
| Comment by Jason Chan [ 23/Mar/21 ] |
|
Spoke with pavithra.vetriselvan, we decided that as part of this ticket, we will add an "optimized" safe reconfig code path that is entirely separate from the idea of forced reconfig. The intended use case is for internal reconfigs that do not change the consensus group (eg. bumping the version or config term). The "optimized" safe reconfig code path will have the same behavior as a regular safe reconfig, but will omit the following checks described in my comment above:
|
| Comment by Jason Chan [ 22/Mar/21 ] |
|
Separating it out from doReplSetReconfig is fine too, but I guess that would require a bit more refactoring to pull out the common code being used by doReplSetReconfig and the new "safeButOpTimizedReconfig". For what it's worth, adding a new flag to doReplSetReconfig might make it easier to choose the behavior of the "safe but optimized reconfig" codepath in case requirements change in the future. |
| Comment by Lingzhi Deng [ 22/Mar/21 ] |
|
My understanding is that we don't want to overload doReplSetReconfig for "internal, safe but optimized" reconfigs. So I guess we might want to separate that out from doReplSetReconfig as a whole? |
| Comment by Jason Chan [ 22/Mar/21 ] |
|
Currently, calling doReplSetReconfig with force=true will have the following differing behavior: After My guess is that checks (3) and (7) above are the most expensive parts of the safe reconfig protocol. I propose that we add a skipCommitOpTimeAndQuorumCheck flag to doReplSetReconfig that will skip the checks (3) and (7) when set to true. |
| Comment by Siyuan Zhou [ 08/Feb/21 ] |
|
I believe we abused "force" in the codebase. It could mean We use "force" to skip them for different reasons, so I agree we should audit and rename the arguments when necessary. |
| Comment by Judah Schvimer [ 08/Feb/21 ] |
|
In general, I think we should separate out the concept of "force reconfig" from the concept of "safe, optimized reconfig". Optimizing reconfigs in cases where we know it is safe makes sense, but it is dangerous for us to mix that in with force reconfig. When making changes to force reconfig, it is easy to forget these use cases where we expect the force reconfig behavior to be safe in practice. It is also hard to think through all of the ways force reconfig could be unsafe when using it in safe scenarios. CC siyuan.zhou, in case you have any thoughts. |