Major - P3
A hedged operation that is failed due to a NetworkInterfaceExceededTimeLimit might cause the original operation to fail. Consider the following as an example (reproducible on v4.4):
- Mongos attempts to hedge a read operation.
- The hedged operation, running on a shard server, needs to query the config server (e.g., as part of waitForReadConcern).
- The config server is temporarily unavailable (e.g., a step-down is in progress), thus it cannot accept new connections.
- Querying the config-server times out for the hedged operation (i.e., NetworkInterfaceExceededTimeLimit).
- The hedged operation completes and returns the time-out error to the mongos server.
- Since the error is not MaxTimeMSExceeded, mongos kills the outstanding operation and returns the non-okay status to the caller (see here).
- The operation fails, while it would have (eventually) succeeded without hedging.
This ticket, or its sub-tasks, should:
- Check if this issue also applies to newer branches (post v4.4).
- Clarify the semantics for failing hedged operations (e.g., what errors may be ignored on hedged operations).
- Fix the implementation to honor the semantics.