[SERVER-31805] rollbackViaRefetchNoUUID fails if rollback occurs during upgrade Created: 02/Nov/17  Updated: 30/Oct/23  Resolved: 14/Nov/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.6.0-rc5, 3.7.1

Type: Improvement Priority: Major - P3
Reporter: Judah Schvimer Assignee: Judah Schvimer
Resolution: Fixed Votes: 0
Labels: bkp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-31881 Safe rollback does not properly remov... Closed
related to SERVER-31988 RollbackViaRefetch makes CollectionIm... Closed
is related to SERVER-31146 Rollback via refetch should only set ... Closed
is related to SERVER-30413 Add function to set options.temp when... Closed
is related to SERVER-31189 fassert if feature compatibility vers... Closed
is related to SERVER-31599 Only allow rollbackViaRefetchNoUUID m... Closed
is related to SERVER-31799 Run rollback fuzzer with upgrades Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.6
Sprint: Repl 2017-11-13, Repl 2017-12-04
Participants:
Linked BF Score: 0

 Description   

We allow rollbacks during upgrade since they run with the old rollback algorithm. rollbackViarefetchNoUUID cannot resync collections with UUIDs (see SERVER-31599), so this should occur during upgrade as well (though we do not have a repro of this exact case).



 Comments   
Comment by Githook User [ 14/Nov/17 ]

Author:

{'name': 'Judah Schvimer', 'username': 'judahschvimer', 'email': 'judah@mongodb.com'}

Message: SERVER-31805 rollbackViaRefetchNoUUID resyncs uuids correctly

(cherry picked from commit aa8b6f7657450d537cc14a77371dcd8742018a28)
Branch: v3.6
https://github.com/mongodb/mongo/commit/a31a62ebd947b338c101aaaac78f185bf9d4153e

Comment by Githook User [ 14/Nov/17 ]

Author:

{'name': 'Judah Schvimer', 'username': 'judahschvimer', 'email': 'judah@mongodb.com'}

Message: SERVER-31805 rollbackViaRefetchNoUUID resyncs uuids correctly
Branch: master
https://github.com/mongodb/mongo/commit/aa8b6f7657450d537cc14a77371dcd8742018a28

Comment by Githook User [ 10/Nov/17 ]

Author:

{'name': 'Judah Schvimer', 'username': 'judahschvimer', 'email': 'judah@mongodb.com'}

Message: SERVER-31805 set collection temp status correctly in rollbackViaRefetchNoUUID
Branch: master
https://github.com/mongodb/mongo/commit/743c2767692b98889d1ff8594ea1d75ee5a115db

Comment by Judah Schvimer [ 10/Nov/17 ]

To summarize a discussion with schwerin, the above plan does not solve the case of rolling back during a downgrade. In that case the rollback node will be rolling back a collMod that removes a UUID. The rolling back node will not have a UUID and the sync source will, but there will be no collMod during recovering to add the UUID back. To fix this, if a rollback node sees that the sync source has a UUID but it does not, it will check if it is in the process of downgrading. If so, it will take the UUID from the sync source.

There are three ways the UUID can be mismatched:
1. The rollback node has a UUID and the sync source does not. Either the rollback node is in the middle of upgrade, and it can just remove its UUID, or the sync source is in the middle of downgrade, and the rollback node will crash in recovering. This is handled by the original plan.
2. The rollback node does not have a UUID and the sync source does. Either the sync source is in upgrade and there will be a collMod during RECOVERING to add the UUID, or the rollback node is in downgrade (this case). Since we fail if the sync source downgrades during rollback, every collection can only have 1 UUID for the entirety of rollback. The rollback will be conducted by namespace, so assigning the remote UUID is safe during rollback. The ops in RECOVERING will either be applied by namespace or by UUID using the UUID we'll be assigning, so that should be safe as well.
3. The rollback node and the sync source have different UUIDs. This can only happen if the partition occurred mid-upgrade and the rollback node finished the upgrade and the sync source conducted its own separate upgrade. Simply removing the UUID is the safest way to get back to the common point, and we will definitely see a collMod during RECOVERING to fix it.

Comment by Githook User [ 09/Nov/17 ]

Author:

{'name': 'Judah Schvimer', 'username': 'judahschvimer', 'email': 'judah@mongodb.com'}

Message: SERVER-31805 Set local collection validation options correctly in rollbackViaRefetchNoUUID
Branch: master
https://github.com/mongodb/mongo/commit/54a3f40dcfbff57f46859ba0b88249e6008597ea

Comment by Githook User [ 08/Nov/17 ]

Author:

{'name': 'Judah Schvimer', 'username': 'judahschvimer', 'email': 'judah@mongodb.com'}

Message: SERVER-31805 provide option to Cloner to preserve UUIDs
Branch: master
https://github.com/mongodb/mongo/commit/26a5d0070ba4a4b9f7e9a5b68f58a0fa75f9e6d3

Comment by Gregory McKeon (Inactive) [ 07/Nov/17 ]

judah.schvimer can this be brought into sprint?

Comment by Judah Schvimer [ 02/Nov/17 ]

To summarize our discussion, one proposed solution is to fix rollbackViaRefetchNoUUID in two ways:
1. when rollback needs to clone a collection, clone it with the UUID from the sync source.
2. when rollback needs to resync metadata, resync it by namespace, and if the UUIDs do not match, remove the local UUID since a collMod or rename operation must exist on the other branch of history to account for it.

Comment by Andy Schwerin [ 02/Nov/17 ]

We'll need to spend some time on a solution that doesn't require rollback to fail during fCV upgrade. Upgrade could run for a while, and some node is going to lost an election during one somewhere.

Comment by Judah Schvimer [ 02/Nov/17 ]

After discussion with william.schultz, this looks serious. It seems like we should fail rollback if we are in a targetVersion when we begin rollback, possibly if we roll back a change to the fCV document (though there will be no UUID changes in the rollback so it's probably safe), and also if the sync source does any operation on the fCV document (if the sync source is mid-upgrade/downgrade at the common point, then the rolling back node will be too and fail earlier). While this is potentially a coarser grained solution than is required, it is difficult to think through all of the different cases we could be in and we have little test coverage of it.

Comment by Judah Schvimer [ 02/Nov/17 ]

rollbackViaRefetchNoUUID appears to also not handle resyncing UUIDs when it resyncs collection metadata. I think this could lead to bugs where UUIDs mismatched between nodes in a replica set.
If both nodes involved in the rollback upgrade on their branches of history with different UUIDs, then we will not make them in sync properly.

Generated at Thu Feb 08 04:28:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.