[SERVER-31631] Bump minimum outgoing wire version for mongod when featureCompatibilityVersion is 3.6 Created: 18/Oct/17  Updated: 08/Jan/24  Resolved: 09/Nov/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.6.0-rc4

Type: Task Priority: Major - P3
Reporter: Tess Avitabile (Inactive) Assignee: Tess Avitabile (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-30561 Make migration fail if source is fcv ... Closed
Backwards Compatibility: Fully Compatible
Sprint: Query 2017-11-13
Participants:

 Description   

A mongod with featureCompatibilityVersion 3.6 should have minimum outgoing wire version equal to its maximum outgoing wire version. This means that when it receives an isMaster response, it will close the connection if the response is not from a 3.6 (or higher) node.

A 3.6 mongod also needs to close connections to 3.4 mongods that it initiated when it goes into upgrading state.

This is will cause heartbeats to 3.4 replica set members to fail, so that the 3.4 member is not considered healthy. This will also prevent FCV=3.6 shards from establishing connections with 3.4 shards.



 Comments   
Comment by Githook User [ 09/Nov/17 ]

Author:

{'name': 'Tess Avitabile', 'username': 'tessavitabile', 'email': 'tess.avitabile@mongodb.com'}

Message: SERVER-31631 Bump minimum outgoing wire version for mongod when featureCompatibilityVersion is 3.6
Branch: master
https://github.com/mongodb/mongo/commit/271879b7a67c9d9b36778692f2a77e04c6403a1f

Comment by Tess Avitabile (Inactive) [ 26/Oct/17 ]

Per in-person discussion, we will add an extra field to replSetHeartbeat, and we will not attempt to close outgoing connections to downgrade shards on FCV bump. In 3.8, it would be desirable to be able to tag and close outgoing connections to downgrade nodes on FCV bump.

Comment by Spencer Brody (Inactive) [ 25/Oct/17 ]

Bummer.
It does look like heatbeats in 3.4 validate that there are no unexpected fields, so that approach would work for heartbeats. I'm still a bit concerned that there'll be another case we're not thinking of, but not overwhelmingly so.

The alternative would be to just close all outgoing connections on FCV change, which would have an impact on any ongoing cross-mongod operations (chunk migration, mapReduce/agg maybe?), as well as adding a cost to re-establish all the pooled connections, but would be definitively safer, and wouldn't require us to add some random meaningless field to heartbeats that we'd need to figure out what to do with in 3.8.

I think tagging the outgoing connections with wireVersion is probably the best way, and also the most future-proof, but I don't know how hard it would be to implement.

mira.carey@mongodb.com

Comment by Esha Maharishi (Inactive) [ 25/Oct/17 ]

For inter-shard, I don't see a lot of risk beyond existing bugs in 3.6 (e.g., drop/recreate + migrations bugs), since the only persisted state transferred across shards is indexes, collection options, and UUIDs on migrations. For this, if the recipient shard is v3.4, it will at least correctly propagate the UUID if/when it is upgraded. If the donor shard is v3.4, an fcv>3.4 recipient shard will fail the migration if it's receiving its first chunk for the collection and the donor doesn't return a UUID.

I agree that adding a feature in the networking layer to tag connections with the client's and server's binary version and FCV is an interesting idea

Comment by Tess Avitabile (Inactive) [ 25/Oct/17 ]

We do not have an easy way to tag outgoing connections to be closed when the FCV is bumped, like we do for incoming connections. We can use the outgoing minWireVersion to prevent new connections to older-version nodes, but there is no good way to close existing connections.

spencer: This means that we need another way for an FCV 3.6 primary will still incorrectly think that a 3.4 secondary is healthy. I think we had a backup plan of adding a field to the replSetHeartbeat command that 3.4 nodes wouldn't recognize. What do you think about that plan?

esha.maharishi, schwerin: This means we do not have a way to stop inter-shard communications that are in progress when the FCV is bumped. I still think we should use the outgoing minWireVersion to prevent new inter-shard connections, but I do not have an idea for handling existing connections. Do you have an idea for how to do this, or can you assess the risk? This seems less risky than the replica set case, since inter-shard connections are transient.

Comment by Tess Avitabile (Inactive) [ 18/Oct/17 ]

Yes, that sounds correct, esha.maharishi.

No, I do not think we need to backport this work. We do not have the problem in mixed 3.4/3.2 replica set that the FCV 3.4 primary can think the 3.2 secondary is healthy. And I do not know of a danger in FCV 3.4 shards communicating with 3.2 shards.

Comment by Tess Avitabile (Inactive) [ 18/Oct/17 ]

A 3.6 mongod also needs to close connections to 3.4 mongods that it initiated when it goes into upgrading state.

Comment by Esha Maharishi (Inactive) [ 18/Oct/17 ]

Cool!

So as part of this, we should do the following for 3.6?

  • initialize the minimum outgoing version LATEST_WIRE_VERSION - 1, which on v3.6 nodes is the 3.4-equivalent (COMMANDS_ACCEPT_WRITE_CONCERN)
  • on seeing the FCV document enter the upgrading or fully upgraded state, set the minimum outgoing version to LATEST_WIRE_VERSION, which on v3.6 nodes is the 3.6-equivalent (SUPPORTS_OP_MSG)
  • on seeing the FCV document enter the fully downgraded state, reset the minimum outgoing version to LATEST_WIRE_VERSION - 1

Also, should we backport this behavior to 3.4? Unlike for v3.6, in v3.4, we will only be able to bump and reset the minimum wire version on upgrade end, rather than on upgrade start.

Comment by Tess Avitabile (Inactive) [ 18/Oct/17 ]

In fact, this should probably be done also when the mongod has a targetVersion as well, to prevent connections with 3.4 nodes during upgrade or downgrade.

Generated at Thu Feb 08 04:27:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.