[SERVER-61262] 5.0/5.1 binary might receive tenant migration state document of 5.2 FCV format, leading to crash. Created: 04/Nov/21  Updated: 27/Oct/23  Resolved: 03/Mar/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Minor - P4
Reporter: Suganthi Mani Assignee: [DO NOT USE] Backlog - Server Serverless (Inactive)
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-57991 Architecture Guide updates for PM-2353 Closed
Assigned Teams:
Serverless
Participants:

 Description   

Here is the scenario that I am thinking
(Assume currently the recipient (R) replica set is running 5.2 Binary and FCV 5.2)
1) R primary receives recipientSyncData cmd with protocol as 'Merge'.
2) R POS instance started and have persisted the initial state doc with 'Merge' protocol'.
3) Now, R primary receives 'setFeatureCompatibilityVersion' cmd to downgrade to 5.0.
4) R primary goes to FCV downgrading state.
5) FCV code Signals all active tenant migrations to abort (but it doesn't wait for it to get aborted or the state doc to mark as garbage collect)
6) R primary successfully able to downgrade to 5.0
7) Now, R POS instance receives the abort signal and aborts the current tenant migration before we persist the 'RecipientPrimaryStartingFCV' info in the state doc (and before compare D (donor) & R FCV check).
8) R primary steps down.
9) At this point, we have a recipient tenant migration state doc on-disk in the which is not marked as garbage collected. So, we consider the migration as active and can resume on new-primary.

Since the replica set is already downgraded to 5.0. We are free to replace the recipient binaries from 5.2 to 5.0. Now if new primary steps up is in 5.0 binary, a pos instance will be started for the state doc w/ 5.2 on 5.0 binary.



 Comments   
Comment by A. Jesse Jiryu Davis [ 17/Feb/22 ]

Maybe this has been fixed? If we released all state doc format changes in 5.2, and we start using Shard Merge in 5.3 or later in production, then this isn't a bug. We won't downgrade a Serverless shard to 5.0. We'll only downgrade to the last-continuous release. But let's make sure last-continuous has all the format changes.

Comment by A. Jesse Jiryu Davis [ 10/Nov/21 ]

Note, this is low priority and we can put it off until the end of Shard Merge if desired.

Comment by Suganthi Mani [ 08/Nov/21 ]

jesse done!

Proposed Solution:
1) Persist the RecipientPrimaryStartingFCV info during the state doc initialization. On step up, the POS instance is started only when if the current FCV and on-disk state doc FCV matches. (backport the fix to 5.0 & 5.1)
2) Add protocol as ignore:true in lower binary versions (5.1 & 5.0)

Abort and wait for migration to be marked as garbage collect (and set expireAt=now) before fcv is set to "downgrading" state. This fix needs to be done on both donor and recipient sides.

Generated at Thu Feb 08 05:51:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.