[SERVER-48179] Removing rollback node will crash the node on transition out of rollback Created: 13/May/20  Updated: 29/Oct/23  Resolved: 17/Mar/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.9.0, 4.4.5

Type: Bug Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Wenbin Zhu
Resolution: Fixed Votes: 0
Labels: safe-reconfig-related
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-48102 Update heartbeat state on primary eve... Closed
related to SERVER-49388 Complete TODO listed in SERVER-48178 Closed
is related to SERVER-48178 Finding self in reconfig may be inter... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Repl 2020-10-19, Repl 2020-11-16, Repl 2020-11-30, Repl 2020-12-14, Repl 2020-12-28, Repl 2021-03-08, Repl 2021-03-22
Participants:
Linked BF Score: 49

 Description   

At the end of rollback, the node transitions to secondary, assuming the state is still in ROLLBACK. However a reconfig via heartbeat may have changed it to REMOVED.



 Comments   
Comment by Githook User [ 18/Mar/21 ]

Author:

{'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}

Message: SERVER-48179 Allow transition to SECONDARY at the end of rollback even it was changed to REMOVED.

(cherry picked from commit 0ae1138bbfc066c4c7eb9f857cf4e29447743a3c)
Branch: v4.4
https://github.com/mongodb/mongo/commit/2cd5154b070f58191762979ad0014dfe78297b70

Comment by Githook User [ 16/Mar/21 ]

Author:

{'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}

Message: SERVER-48179 Allow transition to SECONDARY at the end of rollback even it was changed to REMOVED.
Branch: master
https://github.com/mongodb/mongo/commit/0ae1138bbfc066c4c7eb9f857cf4e29447743a3c

Comment by Siyuan Zhou [ 08/Jul/20 ]

I investigated concurrent state transitions and reconfig in general. First, let's see what would happen in various states.

  Remove Set vote = 0 Set priority = 0
Secondary / Recovering      
Primary Stepdown  Stepdown Stepdown 
Candidate      
Rollback      

When a node is in secondary, it can accept all possible reconfigs of membership change - removal, vote change and priority change. So stepdown should also accept all possible reconfigs since secondary has less constraints than primary.

When a node is in primary and the (force) reconfig makes a primary unelectable, it should accept the reconfig and step down.

Candidate state is exclusive with reconfig. When a reconfig is in progress, an election fails. When an election is in progress, a reconfig will interrupt it to transition to steady-state secondary.

Rollback should be similar to secondary state since it doesn't rely on the config too much. The only problematic case is the concurrent removal and rollback. Removal should be accepted in rollback. We should be able to remove the invariant. Topology coordinator stores the state of config and data RSM separately. setFollowerMode(MemberState::RS_SECONDARY) doesn't affect the removed state. When the removed node is added back, it will transition back to its previous state, secondary or rollback if rollback hasn't finished.

The removed node will continue syncing from its sync source in its rollback state, which seems fine to me. We could also interrupt rollback on removal, but I don't think it's necessary.

To summarize, we should be able to remove the invariant and add tests.

Comment by Tess Avitabile (Inactive) [ 13/May/20 ]

Great, thank you for clarifying!

Comment by Siyuan Zhou [ 13/May/20 ]

The frequency doesn't change in 4.4. My first attempt of SERVER-48102 makes concurrent reconfig via heartbeat and rollback more likely, but I change the approach to keep the current behavior. So the frequency of this bug doesn't change.

Comment by Tess Avitabile (Inactive) [ 13/May/20 ]

Why did the frequency change in 4.4? Is it just that we have more test coverage now, or is this more likely to happen in 4.4?

I don't follow what is the current behavior that you kept. Could you explain that to me?

Comment by Siyuan Zhou [ 13/May/20 ]

I think all the versions are affected. Since it’s super rare before 4.4, we probably don’t have to backport to all versions. I don’t think it’s 4.4 blocker either since now I kept the current behavior to avoid concurrent reconfig and rollback.

Comment by Tess Avitabile (Inactive) [ 13/May/20 ]

siyuan.zhou, do you know what versions are affected?

Generated at Thu Feb 08 05:16:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.