[SERVER-36119] addShard should fail if the added shard's FCV is higher than that of the cluster Created: 13/Jul/18 Updated: 26/Oct/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Sharding, Upgrade/Downgrade |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tess Avitabile (Inactive) | Assignee: | Backlog - Catalog and Routing |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | ShardingRoughEdges, oldshardingemea | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Assigned Teams: |
Catalog and Routing
|
||||
| Operating System: | ALL | ||||
| Participants: | |||||
| Linked BF Score: | 26 | ||||
| Description |
|
In the addShard command, we run setFeatureCompatibilityVersion on the replica set to ensure it has the same featureCompatibilityVersion as the config server. Once this succeeds, we add the shard to config.shards. However, setFeatureCompatibilityVersion only requires that the update to admin.system.version reach a majority of nodes in order to return success. If there are any lower-version mongoses in the cluster, then when they observe the existence of a new shard, they will connect to it and crash if they encounter a node with a higher-version feature compatibility version. We should make the setFeatureCompatibilityVersion command use a w:all writeConcern, so that it waits for the update to reach all members of the new shard (in addition to the w:majority wait that ensures the update is committed). |
| Comments |
| Comment by Githook User [ 02/Aug/18 ] |
|
Author: {'username': 'kaloianm', 'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com'}Message: SERVER-36119 Explicitly downgrade new shard's FCV in the mixed version convert_to_and_from_sharded.js |
| Comment by Tess Avitabile (Inactive) [ 01/Aug/18 ] |
|
Yes |
| Comment by Kaloian Manassiev [ 01/Aug/18 ] |
|
I agree that we can at least fix the test for now by explicitly setting the FCV on the replica set being added to be the 'last-stable' FCV. tess.avitabile, this is what you had in mind, right? |
| Comment by Tess Avitabile (Inactive) [ 01/Aug/18 ] |
|
We could fix the test by setting FCV on the replica set to the downgrade version before adding it to the cluster. If we implement the solution schwerin suggests, we would need to make that change to the test anyway. |
| Comment by Ian Whalen (Inactive) [ 01/Aug/18 ] |
|
greg.mckeon kaloian.manassiev: can you please consider pulling this forward? convert_to_and_from_sharded.js is just a total mess right now: |
| Comment by Andy Schwerin [ 24/Jul/18 ] |
|
Ah. OK. To summarize and offline conversation, if a replica set started with --shardsrv w/ 4.0 binaries wasn't previously used as a standalone replica set, it will report fcv 3.6. As such, this problem only occurs when trying to add a shard that contains some user data already. I think in that case, we should not downgrade the fcv on the target shardautomatically, but instead refuse to add the shard if its fcv is higher than the cluster's fcv. If a user wants to add an fcv 4.0 shard to and fcv 3.6 cluster, they should first need to lower the fcv on that shard to 3.6 and remove any 4.0-specific data and indexes. |
| Comment by Tess Avitabile (Inactive) [ 23/Jul/18 ] |
|
Yes, we have always let you do this. |
| Comment by Andy Schwerin [ 23/Jul/18 ] |
|
Oh, I'm surprised we let you add shards in fCV 4.0 to a fCV 3.6 cluster. I'm still hesitant to require a successful "writeConcern: all" write to addShard, though if we have to allow fCV 4.0 shards to be added to fCV 3.6 replica sets, we may have no choice. |
| Comment by Tess Avitabile (Inactive) [ 23/Jul/18 ] |
|
This is to address the case where the cluster has lower-version FCV and a lower binary version mongos. To be concrete, let's say the mongods all have binary version 4.0 and FCV 3.6, and the mongoses have binary version 3.6. If we add a shard that has FCV 4.0, the config server sends {{ {setFeatureCompatibilityVersion: "3.6"}}} as part of the addShard command. This will succeed as soon as it reaches a majority of the set. But if there is still a node in the set with FCV 4.0, it will cause the 3.6 mongoses in the cluster to crash. This seems like poor behavior–that the addShard succeeds, but the mongoses in the cluster can crash. I think it would be better to wait for the FCV to reach all nodes in the set before successfully adding the shard. |
| Comment by Kaloian Manassiev [ 16/Jul/18 ] |
|
If the config server has the newer FCV this means that all the existing shards should be at the newer FCV already, doesn't it? In which case it would have been just a matter of time before the old mongos instances crash anyways. |
| Comment by Andy Schwerin [ 14/Jul/18 ] |
|
But if one of those nodes is a low version, it's still going to crash? Why is addShard special in this regard? |