[SERVER-67465] Ensure timeouts do not fail hedged operations Created: 22/Jun/22  Updated: 29/Oct/23  Resolved: 24/Aug/22

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: 4.4.17, 6.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: Amirsaman Memaripour Assignee: Amirsaman Memaripour
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to SERVER-69121 Update FCV version for `hedged_reads.js` Closed
related to SERVER-69402 Update FCV version for ttl_index_opti... Closed
is related to SERVER-68704 Clarify the semantics of failing hedg... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Steps To Reproduce:
  • Build mongo binaries from the v4.4 branch.
  • Run the following repeatedly until it fails (usually fails within the first 20 runs):

    ./buildscripts/resmoke.py run --suite=sharding_continuous_config_stepdown jstests/sharding/hedged_reads.js
    

Sprint: Service Arch 2022-07-25, Service Arch 2022-08-08, Service Arch 2022-08-22, Service Arch 2022-09-05
Participants:
Linked BF Score: 0

 Description   

A hedged operation that is failed due to a NetworkInterfaceExceededTimeLimit might cause the original operation to fail. Consider the following as an example (reproducible on v4.4):

  • Mongos attempts to hedge a read operation.
  • The hedged operation, running on a shard server, needs to query the config server (e.g., as part of waitForReadConcern).
  • The config server is temporarily unavailable (e.g., a step-down is in progress), thus it cannot accept new connections.
  • Querying the config-server times out for the hedged operation (i.e., NetworkInterfaceExceededTimeLimit).
  • The hedged operation completes and returns the time-out error to the mongos server.
  • Since the error is not MaxTimeMSExceeded, mongos kills the outstanding operation and returns the non-okay status to the caller (see here).
  • The operation fails, while it would have (eventually) succeeded without hedging.

This ticket, or its sub-tasks, should:

  • Check if this issue also applies to newer branches (post v4.4).
  • Clarify the semantics for failing hedged operations (e.g., what errors may be ignored on hedged operations).
  • Fix the implementation to honor the semantics.


 Comments   
Comment by Githook User [ 08/Sep/22 ]

Author:

{'name': 'Amirsaman Memaripour', 'email': 'amirsaman.memaripour@mongodb.com', 'username': 'samanca'}

Message: SERVER-67465 Ensure network timeouts do not fail hedged operations

(cherry picked from commit 1744ab66eafba2dcc6dd96d7fa0d0d77eeae35d8)
Branch: v4.4
https://github.com/mongodb/mongo/commit/97ba88b52784d3c81a23a2994f50d16f3bf2dab0

Generated at Thu Feb 08 06:08:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.