[SERVER-41786] Set a smaller connection timeout for _await_cmd_all_nodes Created: 16/Jun/19  Updated: 20/Jun/19  Resolved: 20/Jun/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.3 Required
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Lingzhi Deng Assignee: Lingzhi Deng
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-41608 Re-enable jstestfuzz_concurrent_repli... Closed
Operating System: ALL
Backport Requested:
v4.2, v4.0, v3.6
Sprint: Repl 2019-07-01
Participants:
Linked BF Score: 20

 Description   

The problem was that the timeout for get_primary() is 30s and that the connection timeout for mongo_client is also default to 30s for the test. So if it blocks on connecting to a node, the loop of finding the primary is also stuck. And so it times out first before being able to check isMaster on all nodes. I think we should set a smaller connection timeout (timeout_millis) when looping through all the nodes so we don't wait on a bad node forever.



 Comments   
Comment by Lingzhi Deng [ 20/Jun/19 ]

Talked to Max offline, we decided to close this CR without merging it in. multi_shard_multi_stmt_txn_{kill,stepdown}_primary_jscore_passthrough were fixed in SERVER-41806. And the connection slowness seen in jstestfuzz_concurrent_replication_continuous_stepdown_flow_control might have been caused by a more fundamental issue. We will need to make sure this doesn't happen again when we re-enable the suite in SERVER-41608. If slow connection is still an issue at that time, we will reconsider adding this.

Generated at Thu Feb 08 04:58:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.