[SERVER-78557] Allow targeting to wait until there is a significant topology change before using a retry Created: 29/Jun/23 Updated: 31/Oct/23 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Lamont Nelson | Assignee: | Wenqin Ye |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | sharding-nyc-subteam3 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Sharding NYC
|
||||||||
| Sprint: | Sharding NYC 2023-09-18, Sharding NYC 2023-10-02, Sharding NYC 2023-10-16, Sharding NYC 2023-10-30 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 113 | ||||||||
| Story Points: | 3 | ||||||||
| Description |
|
In the current version of the code we have retry loops with no backoff and asynchronous replica set monitor failure notification. This creates the scenario where a request can fail, the calling thread calls failedHost on the RSM, and the retry loop then immediately tries another request. This will happen within the span of microseconds, and the next attempt may result in the same failure due to not enough time passing. This ticket is to improve this behavior by blocking the targeter when an error occurs, such as NotPrimary or InterruptedDueToReplStateChange (list not exhaustive), such that we return from the method that reports the failure to the RSM once the getHostsOrRefresh request of the RSM will return a different result (or a timeout occurs). |