[SERVER-60344] Action plan on lagging setFCV replicas breaking tests Created: 30/Sep/21 Updated: 29/Oct/23 Resolved: 13/Oct/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.0.28 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Andrew Shuvalov (Inactive) | Assignee: | Andrew Shuvalov (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Sharding 2021-10-04, Sharding 2021-10-18 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 145 | ||||||||||||||||
| Description |
|
This ticket is to discuss and decide on proper actions on test breakages investigated by cheahuychou.mao:
There are several questions we need to resolve before addressing this. Is set FCV() majority commit correct or we want to replicate to all replicas first?
Is it correct that Mongos pre-warms connections to non primaries?
The proposed action:
|
| Comments |
| Comment by Githook User [ 13/Oct/21 ] |
|
Author: {'name': 'Andrew Shuvalov', 'email': 'andrew.shuvalov@mongodb.com', 'username': 'shuvalov-mdb'}Message: |
| Comment by Andrew Shuvalov (Inactive) [ 30/Sep/21 ] |
|
Based on the conversation above I've made a change to simply temporarily suspend the crash during initial warm-up. Please review: https://github.com/10gen/mongo/pull/1001 |
| Comment by Randolph Tan [ 30/Sep/21 ] |
|
I wasn't part of the discussion when the project was laid out, but I do certainly agree with Lamont that it is that likely users would be using secondary connections. I checked the scope documents and it didn't explicitly say to pre-warm connections only to the primaries. |
| Comment by Andrew Shuvalov (Inactive) [ 30/Sep/21 ] |
|
anton.oyung, renctan - do you remember why it was decided to pre-warm connections to all ReplicaSet servers found in the connection string instead of finding just the primary? I spoke with lamont.nelson and his opinion was I should not touch this logic because a latency sensitive user might use reads from secondaries for performance reasons and limiting pre-warm to just primary defeats the purpose of this feature. |
| Comment by Vishnu Kaushik [ 30/Sep/21 ] |
|
Just some thoughts I had when I initially diagnosed the BF on replication (copying my comment from there): The mongos should be alright with viewing a minority of nodes that haven't reached a compatible wire version yet. I think the current method of fasserting immediately is a little bit extreme. The reason I say this is, for the sake of fault tolerance it is possible for the mongos to try to connect to a cluster where some minority of nodes are down. A minority of nodes being on the wrong FCV version is a less severe crime than them being down, so maybe the fassert is a little harsh. I think the idea of finding the primary first is nice. The mongos should be able to function fine so long as a majority of nodes in the shard are behaving as expected. |