[SERVER-47167] Make hedged read tests more robust Created: 27/Mar/20  Updated: 29/Oct/23  Resolved: 30/Mar/20

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: 4.4.0-rc0, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Cheahuychou Mao Assignee: Cheahuychou Mao
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Sharding 2020-04-06
Participants:
Linked BF Score: 0

 Description   

The test setup in hedged_reads.js and hedging_metrics_server_status.js currently have unintended effects on commands that are run outside of the test cases. This has led to various failures on evergreen.

  • The failpoint networkInterfaceSendRequestsToTargetHostsInAlphabeticalOrder forces NetworkInterfaceTL to wait to acquire a connection from each target host so that requests can be sent in deterministic order. However, since the command's timer is not set until the request is about to get sent after the connection is already acquired, all commands are effectively run without timeouts when this failpoint is enabled. That leads to various unexpected behaviors such as timeouts (see BF-16720 and BF-16680--). To solve this, we should add a predicate to the failpoint so that it can only be entered when the operation should be hedged. 
  • The tests use blockConnection to simulate delays on target nodes. When the failpoint is enabled, other commands could also block (e.g. mirrored commands). This causes the command to run longer than it should (the block time is 5mn) and sometimes lead to failures. To solve this, we should use mode {times: 1} and pass namespace into data so only one command will be blocked on each node.
  • Unfortunately, _killOperations sent to the slower node by NetworkInterfaceTL when the command finishes could sometimes arrive and finish on the "slower" node before the command starts on that node. So the operation will never get killed. This is an undesirable behavior. However, it is caused by the delay on the host itself so there is no obvious way around it yet. For now, we should make the command block on the faster node for some time to allow the operation to start on the slower node. 


 Comments   
Comment by Githook User [ 30/Mar/20 ]

Author:

{'email': 'cheahuychou.mao@mongodb.com', 'name': 'Cheahuychou Mao', 'username': 'cheahuychou'}

Message: SERVER-47167 Make hedged read tests more robust

(cherry picked from commit ffff3f7fce95d45192c8b6cb65845907d6128922)
Branch: v4.4
https://github.com/mongodb/mongo/commit/73deb0a22f6b3fd1cfdf525499608100d4340dd2

Comment by Githook User [ 30/Mar/20 ]

Author:

{'email': 'cheahuychou.mao@mongodb.com', 'name': 'Cheahuychou Mao', 'username': 'cheahuychou'}

Message: SERVER-47167 Make hedged read tests more robust
Branch: master
https://github.com/mongodb/mongo/commit/ffff3f7fce95d45192c8b6cb65845907d6128922

Generated at Thu Feb 08 05:13:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.