[SERVER-67716] Clarify policy for when responses to hedged requests should cancel outstanding requests Created: 30/Jun/22 Updated: 05/Dec/22 |
|
| Status: | Open |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | George Wangensteen | Assignee: | Backlog - Service Architecture |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Service Arch
|
| Participants: |
| Description |
|
When an operation is 'hedged', mongos sends copies of that operation to multiple mongod nodes. The purpose of this is to 'hedge' the operation - in case one of the mongods cannot or is slow to respond, we may get a response back from another. This avoids slow queries due to a slower mongod and avoids us needing to retry the operation after a timeout if one mongod is not responding. Today, when we receive a response back from one mongods for a hedged operation, we will often cancel the outstanding operation on the other mongod, even if the response we received was an error. The only errors for which we won't cancel outstanding operations are maxTimeMS expired and stale sharding config errors. This may be the correct policy, but it also may result in us cancelling operations that may succeed and forcing lengthier retries, preventing us from getting any benefit from hedging. This policy is also opaque and hard-coded into the networking layer, and is not configurable by consumers of the API.
We should clarify what responses to a hedge operation should result in us cancelling outstanding hedged requests, and which should cause us to continue waiting. Once we have this policy, we should consider making it configurable on a per-request basis for consumers of the hedging API. |