[SERVER-32633] Provide a framework to ensure a last-stable binary version node in a latest featureCompatibilityVersion replica set detects it has become isolated from a majority for all last stable binary versions >= 3.8 and crashes upon this detection Created: 10/Jan/18  Updated: 16/Jan/18  Resolved: 16/Jan/18

Status: Closed
Project: Core Server
Component/s: Upgrade/Downgrade
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Maria van Keulen Assignee: Maria van Keulen
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-29428 Make 3.4 mongod fail gracefully in fe... Closed
Participants:

 Description   

As of SERVER-29350, a last-stable binary version secondary in a replica set cannot replicate from a latest binary version primary once the primary's featureCompatibilityVersion has been set to the latest version. Instead, its attempts to send heartbeats to the primary will fail with IncompatibleServerVersion errors. Rather than continuously sending these failing heartbeats, a last-stable binary version secondary in this state should gracefully fail. This framework should be available for all last-stable binary versions >= 3.8.
A proposed implementation to detect this scenario and crash upon detection follows:

  • If a node detects IncompatibleServerVersion errors upon sending heartbeats to a majority of the nodes in a replica set, it should crash. It is necessary to detect these errors from a majority of the nodes to make sure node availability is not hindered in the event that there are multiple last-stable binary version nodes.
  • This detection will be performed by the replication coordinator. The replication coordinator will conclude that a node has become isolated when the heartbeats the node sends fail with IncompatibleServerVersion errors from a majority. This high-level implementation is justifiable because we only want a last-stable binary version mongod to crash in the event that it is part of a latest featureCompatibilityVersion replica set.


 Comments   
Comment by Andy Schwerin [ 10/Jan/18 ]

The proposed implementation is error prone. You'll have to decide whether to count hidden node's, arbiters, etc. Any mistake could crash N-1 nodes in the set.

I usually avoid crashing if waiting might change my situation. mongo, for example, doesn't crash if it cannot reach a config server, because that saves operators from having to start nodes in a certain order.

It seems risky, it seems like it needs a lot of testing, and so I'd like to know if we have to.

Comment by Maria van Keulen [ 10/Jan/18 ]

schwerin I believe it is a usability improvement to cause an isolated node to crash rather than letting it continuously spin on the IncompatibleServerVersion error, since crashing the node makes the error more evident and the user would have to shut down the node anyway to swap out the binary. What sorts of unintended consequences did you have in mind?

Comment by Andy Schwerin [ 10/Jan/18 ]

I think this could have unintended consequences. Is it really necessary?

Generated at Thu Feb 08 04:30:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.