[SERVER-37574] Force reconfig should kill user operations Created: 11/Oct/18 Updated: 29/Oct/23 Resolved: 20/May/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.12 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Suganthi Mani |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2018-10-22, Repl 2019-03-11, Repl 2019-03-25, Repl 2019-04-08, Repl 2019-04-22, Repl 2019-05-06, Repl 2019-05-20, Repl 2019-06-03 | ||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 12 | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Force reconfig from the command and from a heartbeat can lead to a stepdown and thus needs to kill user operations. |
| Comments |
| Comment by Githook User [ 20/May/19 ] |
|
Author: {'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}Message: |
| Comment by Suganthi Mani [ 20/May/19 ] |
|
If the node has already step down due to reconfig code-path, then step down via heartbeat should simply return . Just noticed that when reading the code, before returning, it should also signal waiters waiting on the step down event. https://github.com/mongodb/mongo/commit/8e6ad096f8a8b81e1be01d012920f52332650d6f didn't address it. So, reverted the commit. |
| Comment by Githook User [ 20/May/19 ] |
|
Author: {'name': 'Suganthi Mani', 'email': 'suganthi.mani@mongodb.com', 'username': 'smani87'}Message: Revert " This reverts commit 8e6ad096f8a8b81e1be01d012920f52332650d6f. |
| Comment by Githook User [ 19/May/19 ] |
|
Author: {'email': 'suganthi.mani@mongodb.com', 'name': 'Suganthi Mani', 'username': 'smani87'}Message: |
| Comment by Suganthi Mani [ 07/May/19 ] |
|
This ticket should handle below 2 things. 1) Unconditional step down after acquiring mutex should check whether the member state is still primary and return if not. Else, it can lead to this invariant getting failed.
2) Reconfig cmd and reconfig via heartbeat should be able to take care of updating the term & resetting _pendingTermUpdateDuringStepDown (like we do it in unconditional step down) if they cause step down and _pendingTermUpdateDuringStepDown is set.
This is a 4.0 bug and the fix must be backported. EDIT: Problem #2 will be addressed by |
| Comment by Ravind Kumar (Inactive) [ 16/Apr/19 ] |
|
Note for docs - we should also update the following text for replSetReconfig
To be more specific. Via slack:
both of these cases result in retryable errors, and should be transparent to applications. Some of this information is not quite related to |
| Comment by Judah Schvimer [ 19/Mar/19 ] |
|
While I agree that we will not roll back w:majority writes that get acknowledged (since we close the connections), we could still write ordered operations out of order as described in |
| Comment by Suganthi Mani [ 18/Mar/19 ] |
|
judah.schvimer, in 4.0, we don't kill user operations on reconfig via command/heartbeat. But, we still close connections on step down. So, the current request result won't reach the client. I feel its good to backport |
| Comment by Tess Avitabile (Inactive) [ 18/Mar/19 ] |
|
Got it. In that case, we should backport this work. |
| Comment by Judah Schvimer [ 18/Mar/19 ] |
|
I do think we need to backport it due to Alternatively we could check that the term hasn't changed since starting the operation in the OpObserver in logOp, and backport that which would maybe be easier. |
| Comment by Tess Avitabile (Inactive) [ 18/Mar/19 ] |
|
suganthi.mani, judah.schvimer, do you think we need to backport this change? I'm not certain we need to. On older versions, we do not need to kill operations on stepdown in order to avoid deadlock; it just may take us longer to acquire the global lock. Additionally, it's okay if we don't kill write operations, since they will check if the node can accept writes after their next lock acquisition. I would prefer not to backport this if we don't need to, since the code has changed so much in this area. |
| Comment by Suganthi Mani [ 14/Mar/19 ] |
|
Note: When this ticket is closed, need to file a DOC ticket to update https://docs.mongodb.com/manual/reference/method/rs.reconfig/#availability section
After PM-639, we no longer close all client connections on primary step down. |