[SERVER-47949] Do not install higher safe config via heartbeats in primary Created: 05/May/20 Updated: 29/Oct/23 Resolved: 09/May/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.0-rc7, 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Siyuan Zhou | Assignee: | William Schultz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | safe-reconfig-related | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||
| Sprint: | Repl 2020-05-18 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 37 | ||||||||||||||||
| Description |
|
A primary runs an automatic reconfig to bump the config term on stepup. If it learned a newer config via heartbeats during catchup mode and drain mode, the auto reconfig may fail, leaving an inconsistent config term with the primary's term. The primary should not fetch the higher safe config in primary state. It also should not install the new config in _scheduleHeartbeatReconfig_inlock. Force config should always be accepted via heartbeats on primary though. Consider making the error check on stepup stricter too. |
| Comments |
| Comment by Githook User [ 14/May/20 ] | ||||||||||||||||||||||||||||||
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: (cherry picked from commit 2546fe1c22b0777ca68e604376900ea11f10ee3a)
(cherry picked from commit 74cc45677605baf25cd57caa497e254883e88567)
(cherry picked from commit 852d7eb6c7e7b34a78bd6f69ec8c1a65fa179c9e) | ||||||||||||||||||||||||||||||
| Comment by Githook User [ 09/May/20 ] | ||||||||||||||||||||||||||||||
|
Author: {'name': 'William Schultz', 'email': 'william.schultz@mongodb.com', 'username': 'will62794'}Message: | ||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 05/May/20 ] | ||||||||||||||||||||||||||||||
|
This is a problem only when the primary runs the automatic reconfig on stepup when it just received and is installing a safe config via heartbeat. If the config via heartbeat is installed earlier, then the automatic reconfig will bump its version and work as expected. If the config via heartbeat comes later, the primary will reject it since its config has a higher term. User initiated safe reconfig is not affected. Best effort is our design before disallowing safe reconfig in catchup mode and drain mode. Now we can have a stricter rule. | ||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 05/May/20 ] | ||||||||||||||||||||||||||||||
|
Actually, the user initiated reconfig case might not be an issue since we don't allow non force reconfigs to run during drain mode. | ||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 05/May/20 ] | ||||||||||||||||||||||||||||||
|
judah.schvimer Good point. Yeah, there might be. For example if an in-progress user initiated reconfig interrupts the step up reconfig, but then the user reconfig fails for some reason (or gets interrupted itself), this might lead to the same issue. I will consider this while I'm fixing this bug. | ||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 05/May/20 ] | ||||||||||||||||||||||||||||||
|
Does a similar bug exist where user initiated reconfigs can interrupt the step-up reconfig, or are user initiated reconfigs not allowed until the step-up reconfig completes? | ||||||||||||||||||||||||||||||
| Comment by William Schultz (Inactive) [ 05/May/20 ] | ||||||||||||||||||||||||||||||
|
siyuan.zhou It seems this bug leads to a protocol safety issue. For example, this behavior:
At the final state, it seems equivalent to if n3 had been elected in config (2,1), so it's equivalent to having been elected and not writing a config in a new term, which is unsafe. We depend on the atomicity of writing the new config on step up (excepting the interruption case mentioned below). To judah.schvimer's point, I think that we expected that a concurrent reconfig command interrupting the step up reconfig would be safe, since that reconfig would write a config in the new term. I don't think we accounted for a concurrent heartbeat reconfig, though, writing a config in the older term. Basically, once you become a writable primary, we need to make sure that the node has written a config in its current term so that it can't execute a reconfig before committing that config. We violate that here. I agree that preventing non force heartbeat reconfigs while we are in primary state seems a correct solution. In hindsight it seems kind of odd that we ever allowed non force configs to be installed via heartbeat while a node is primary. | ||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 05/May/20 ] | ||||||||||||||||||||||||||||||
What's the risk here, if the automatic reconfig fails? That sounds problematic, and this ticket is saying it's a problem, but also the code says that the automatic reconfig is "best effort", so we can't be relying too strongly on the automatic reconfig succeeding. | ||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 05/May/20 ] | ||||||||||||||||||||||||||||||
|
william.schultz, does the solution make sense to you? I cannot think of any correctness issue. The TLA+ spec makes election and config term increment in an atomic step, so this ticket proposes a behavior closer to the spec. |