[SERVER-78557] Allow targeting to wait until there is a significant topology change before using a retry Created: 29/Jun/23  Updated: 31/Oct/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Lamont Nelson Assignee: Wenqin Ye
Resolution: Unresolved Votes: 0
Labels: sharding-nyc-subteam3
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-50342 Make version of Shard::runCommand tha... Open
Assigned Teams:
Sharding NYC
Sprint: Sharding NYC 2023-09-18, Sharding NYC 2023-10-02, Sharding NYC 2023-10-16, Sharding NYC 2023-10-30
Participants:
Linked BF Score: 113
Story Points: 3

 Description   

In the current version of the code we have retry loops with no backoff and asynchronous replica set monitor failure notification. This creates the scenario where a request can fail, the calling thread calls failedHost on the RSM, and the retry loop then immediately tries another request. This will happen within the span of microseconds, and the next attempt may result in the same failure due to not enough time passing.

This ticket is to improve this behavior by blocking the targeter when an error occurs, such as NotPrimary or InterruptedDueToReplStateChange (list not exhaustive), such that we return from the method that reports the failure to the RSM once the getHostsOrRefresh request of the RSM will return a different result (or a timeout occurs).


Generated at Thu Feb 08 06:38:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.