[SERVER-60344] Action plan on lagging setFCV replicas breaking tests Created: 30/Sep/21  Updated: 29/Oct/23  Resolved: 13/Oct/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.0.28

Type: Bug Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Andrew Shuvalov (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Problem/Incident
is caused by SERVER-44152 Pre-warm connection pools in mongos Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2021-10-04, Sharding 2021-10-18
Participants:
Linked BF Score: 145

 Description   

This ticket is to discuss and decide on proper actions on test breakages investigated by cheahuychou.mao:

  1. The setFeatureCompatibilityVersion() replicates to majority committed. It means when the command succeeds some replicas may still lag on the desired version.
  2. When Mongos detects incompatible version is check-fails.
  3. The Mongos connection pre-warm feature amplifies this race because now Mongos has a feature to pre-warm connections at startup, which includes non-primaries. A replica may still lag on set FCV() and causes Mongos to crash

There are several questions we need to resolve before addressing this.

Is set FCV() majority commit correct or we want to replicate to all replicas first?

  • My opinion it is correct. What an SRE wants from the system is that changing FCV is fast and durable. The condition when SRE performs upgrade and downgrade is usually not the case when it's a good time to reflex on complications when set FCV does not return or times out. What SRE wants is to get the confirmation and move forward with next actions.

Is it correct that Mongos pre-warms connections to non primaries?

  • My opinion: it might be tricky. Mongos doesn't know the primary, it gets the Grid::get() information and iterates over `shard.getHost()... getServers()` to pre-warm. Should it take additional wait time to find out just the primary, so make it 2-step procedure? Do we need to make the code that cumbersome?

 

The proposed action:

  1. make a best effort to find a simple solution to pre-warming the connections to the primary only.
  2. If it gets unclean and/or requires extra overhead comparing to the existing code, just fix the tests.


 Comments   
Comment by Githook User [ 13/Oct/21 ]

Author:

{'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}

Message: SERVER-60344 Mogos crash on incompatible mongod suspended during warm-up
Branch: v4.2
https://github.com/mongodb/mongo/commit/30dc4f5d4bd98c4d2eb3d37644169b83dfcb5ede

Comment by Andrew Shuvalov (Inactive) [ 30/Sep/21 ]

Based on the conversation above I've made a change to simply temporarily suspend the crash during initial warm-up. Please review: https://github.com/10gen/mongo/pull/1001

Comment by Randolph Tan [ 30/Sep/21 ]

I wasn't part of the discussion when the project was laid out, but I do certainly agree with Lamont that it is that likely users would be using secondary connections. I checked the scope documents and it didn't explicitly say to pre-warm connections only to the primaries.

Comment by Andrew Shuvalov (Inactive) [ 30/Sep/21 ]

anton.oyung, renctan - do you remember why it was decided to pre-warm connections to all ReplicaSet servers found in the connection string instead of finding just the primary? I spoke with lamont.nelson and his opinion was I should not touch this logic because a latency sensitive user might use reads from secondaries for performance reasons and limiting pre-warm to just primary defeats the purpose of this feature.

Comment by Vishnu Kaushik [ 30/Sep/21 ]

Just some thoughts I had when I initially diagnosed the BF on replication (copying my comment from there): The mongos should be alright with viewing a minority of nodes that haven't reached a compatible wire version yet. I think the current method of fasserting immediately is a little bit extreme. The reason I say this is, for the sake of fault tolerance it is possible for the mongos to try to connect to a cluster where some minority of nodes are down. A minority of nodes being on the wrong FCV version is a less severe crime than them being down, so maybe the fassert is a little harsh.

I think the idea of finding the primary first is nice. The mongos should be able to function fine so long as a majority of nodes in the shard are behaving as expected.

Generated at Thu Feb 08 05:49:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.