[SERVER-67716] Clarify policy for when responses to hedged requests should cancel outstanding requests Created: 30/Jun/22  Updated: 05/Dec/22

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: George Wangensteen Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Service Arch
Participants:

 Description   

When an operation is 'hedged', mongos sends copies of that operation to multiple mongod nodes. The purpose of this is to 'hedge' the operation - in case one of the mongods cannot or is slow to respond, we may get a response back from another. This avoids slow queries due to a slower mongod and avoids us needing to retry the  operation after a timeout if one mongod is not responding.

Today, when we receive a response back from one mongods for a hedged operation, we will often cancel the outstanding operation on the other mongod, even if the response we received was an error. The only errors for which we won't cancel outstanding operations are maxTimeMS expired and stale sharding config errors. This may be the correct policy, but it also may result in us cancelling operations that may succeed and forcing lengthier retries, preventing us from getting any benefit from hedging. This policy is also opaque and hard-coded into the networking layer, and is not configurable by consumers of the API. 

 

We should clarify what responses to a hedge operation should result in us cancelling outstanding hedged requests, and which should cause us to continue waiting. Once we have this policy, we should consider making it configurable on a per-request basis for consumers of the hedging API.


Generated at Thu Feb 08 06:08:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.