[SERVER-72980] Configurable replSetReconfig isSelf socket timeouts Created: 18/Jan/23  Updated: 30/Jan/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Jack Wearden Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-66050 findSelfInConfig should attempt fast ... Closed
related to SERVER-16824 Run isSelf concurrently for all members Backlog
Assigned Teams:
Replication
Participants:

 Description   

During testing we learned that a force replica set reconfig when nodes are down will take a substantially longer time to complete than first anticipated.

We'd like to tune that 30 second value by making it configurable through the replSetReconfig command.

In our uses, this command is run when we are certain the affected nodes are in a network partition, and timing out closer to 2-5 seconds is more reasonable. In large clusters, the 30second timeout could extend into several minutes (perhaps several tens of minutes in the case of clusters with lots of non-voting secondaries)

Perhaps along side the maxTimeMs parameter, there could be a connectTimeoutMs parameter?



 Comments   
Comment by Opal Hoyt [ 30/Jan/23 ]

Backlogging this in favor of BACKPORT-14562

Comment by Xuerui Fa [ 26/Jan/23 ]

Sounds good, we originally did not want to do those backports since EOL was coming soon, but I've requested them so that we can discuss more

Comment by Jack Wearden [ 26/Jan/23 ]

xuerui.fa@mongodb.com could we also request backport for SERVER-66050 to 4.4? (And possibly 4.2 considering there will be customers using that for a short while on atlas past EOL?)

Comment by Xuerui Fa [ 23/Jan/23 ]

Also requesting backports for SERVER-66050

Comment by Xuerui Fa [ 23/Jan/23 ]

We recently completed SERVER-66050, which allows the node to try the fast path for every member in the replica set before attempting the slow path (which is the path that can take 30 seconds and result in socket timeout). We hope that SERVER-66050 largely addresses instances of slow isSelf commands and improves the overall speed.

Making the parameter tunable is an interesting idea, we could consider doing this, but if the isSelf command has significantly improved after SERVER-66050, it may not be necessary

 

Generated at Thu Feb 08 06:23:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.