[SERVER-46758] setFCV can be interrupted before an FCV change is majority committed and rollback the FCV without running the setFCV server logic Created: 10/Mar/20 Updated: 29/Oct/23 Resolved: 22/Apr/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Upgrade/Downgrade |
| Affects Version/s: | None |
| Fix Version/s: | 4.0.20, 4.2.8, 4.4.0-rc4, 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dianna Hohensee (Inactive) | Assignee: | Jason Chan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v4.4, v4.2, v4.0
|
||||||||||||||||||||||||
| Sprint: | Repl 2020-04-06, Repl 2020-04-20, Repl 2020-05-04 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 30 | ||||||||||||||||||||||||
| Description |
|
I believe this bug goes back all the way back to the beginning of the setFCV framework. Therefore it will need to be backport'ed. A setFCV cmd will change the FCV value twice: first to put FCV into upgrading / downgrading; then to put FCV into fully upgraded / fully downgraded. For each of these FCV writes, we wait for majority confirmation before proceeding. However, setFCV can be interrupted while waiting for majority write concern – InterruptedDueToReplStateChange for example – and roll back a step in FCV value. This manifested in test failures where the in-memory FCV value was found not to match the persisted FCV value: the persisted value had been rolled back, but the in-memory value was left unchanged by roll back. Recover to a stable timestamp wipes out writes back to the checkpoint and then plays writes forward from the oplog up to the desired point, so an FCV value change never goes through the OpObserver, even. I think it’s okay if rollback moves FCV from fully upgraded/downgraded to upgrading/downgrading because the user can simply rerun setFCV in the right direction and the logic is idempotent. This scenario is the same as if the server fails at any point in setFCV and setFCV is retried and we know it works. However, rolling back from upgrading/downgrading to fully downgraded/upgraded requires running the setFCV logic to make sure the rest of the server settings match the new FCV. And then I believe we must finish an upgrading/downgrading before we can move to downgrading/upgrading. Config servers will be their own special problem because their setFCV logic involves setting the shard servers first or last (I forget). |
| Comments |
| Comment by Githook User [ 02/Jun/20 ] |
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@10gen.com', 'username': 'jasonjhchan'}Message: (cherry picked from commit aa527109a28bec0b6fe2763fce8a447ead0c02dd) |
| Comment by Githook User [ 01/Jun/20 ] |
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@10gen.com', 'username': 'jasonjhchan'}Message: (cherry picked from commit aa527109a28bec0b6fe2763fce8a447ead0c02dd) |
| Comment by Githook User [ 01/Jun/20 ] |
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@10gen.com', 'username': 'jasonjhchan'}Message: (cherry picked from commit aa527109a28bec0b6fe2763fce8a447ead0c02dd) |
| Comment by Githook User [ 01/Jun/20 ] |
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@10gen.com', 'username': 'jasonjhchan'}Message: (cherry picked from commit aa527109a28bec0b6fe2763fce8a447ead0c02dd) |
| Comment by Githook User [ 04/May/20 ] |
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@10gen.com', 'username': 'jasonjhchan'}Message: (cherry picked from commit aa527109a28bec0b6fe2763fce8a447ead0c02dd) |
| Comment by Githook User [ 22/Apr/20 ] |
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@10gen.com', 'username': 'jasonjhchan'}Message: |
| Comment by Jason Chan [ 17/Apr/20 ] |
|
I don't believe this bug exists in 3.6 because it looks like this behaviour is specific to RTT. In Rollback via refetch, we end up refetching the fcv document and doing an update, which will trigger the onInsertOrUpdate here and update the in-memory FCV on storage commit. I think this needs to be backported to v4.0 when we start using RTT. |