[SERVER-48179] Removing rollback node will crash the node on transition out of rollback Created: 13/May/20 Updated: 29/Oct/23 Resolved: 17/Mar/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0, 4.4.5 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Siyuan Zhou | Assignee: | Wenbin Zhu |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | safe-reconfig-related | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||
| Sprint: | Repl 2020-10-19, Repl 2020-11-16, Repl 2020-11-30, Repl 2020-12-14, Repl 2020-12-28, Repl 2021-03-08, Repl 2021-03-22 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 49 | ||||||||||||||||||||||||
| Description |
|
At the end of rollback, the node transitions to secondary, assuming the state is still in ROLLBACK. However a reconfig via heartbeat may have changed it to REMOVED. |
| Comments |
| Comment by Githook User [ 18/Mar/21 ] | ||||||||||||||||||||
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: (cherry picked from commit 0ae1138bbfc066c4c7eb9f857cf4e29447743a3c) | ||||||||||||||||||||
| Comment by Githook User [ 16/Mar/21 ] | ||||||||||||||||||||
|
Author: {'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}Message: | ||||||||||||||||||||
| Comment by Siyuan Zhou [ 08/Jul/20 ] | ||||||||||||||||||||
|
I investigated concurrent state transitions and reconfig in general. First, let's see what would happen in various states.
When a node is in secondary, it can accept all possible reconfigs of membership change - removal, vote change and priority change. So stepdown should also accept all possible reconfigs since secondary has less constraints than primary. When a node is in primary and the (force) reconfig makes a primary unelectable, it should accept the reconfig and step down. Candidate state is exclusive with reconfig. When a reconfig is in progress, an election fails. When an election is in progress, a reconfig will interrupt it to transition to steady-state secondary. Rollback should be similar to secondary state since it doesn't rely on the config too much. The only problematic case is the concurrent removal and rollback. Removal should be accepted in rollback. We should be able to remove the invariant. Topology coordinator stores the state of config and data RSM separately. setFollowerMode(MemberState::RS_SECONDARY) doesn't affect the removed state. When the removed node is added back, it will transition back to its previous state, secondary or rollback if rollback hasn't finished. The removed node will continue syncing from its sync source in its rollback state, which seems fine to me. We could also interrupt rollback on removal, but I don't think it's necessary. To summarize, we should be able to remove the invariant and add tests. | ||||||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 13/May/20 ] | ||||||||||||||||||||
|
Great, thank you for clarifying! | ||||||||||||||||||||
| Comment by Siyuan Zhou [ 13/May/20 ] | ||||||||||||||||||||
|
The frequency doesn't change in 4.4. My first attempt of | ||||||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 13/May/20 ] | ||||||||||||||||||||
|
Why did the frequency change in 4.4? Is it just that we have more test coverage now, or is this more likely to happen in 4.4? I don't follow what is the current behavior that you kept. Could you explain that to me? | ||||||||||||||||||||
| Comment by Siyuan Zhou [ 13/May/20 ] | ||||||||||||||||||||
|
I think all the versions are affected. Since it’s super rare before 4.4, we probably don’t have to backport to all versions. I don’t think it’s 4.4 blocker either since now I kept the current behavior to avoid concurrent reconfig and rollback. | ||||||||||||||||||||
| Comment by Tess Avitabile (Inactive) [ 13/May/20 ] | ||||||||||||||||||||
|
siyuan.zhou, do you know what versions are affected? |