-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Internal Code
-
None
-
Fully Compatible
-
ALL
-
v4.4
-
Sharding 2020-04-06
-
0
The test setup in hedged_reads.js and hedging_metrics_server_status.js currently have unintended effects on commands that are run outside of the test cases. This has led to various failures on evergreen.
- The failpoint networkInterfaceSendRequestsToTargetHostsInAlphabeticalOrder forces NetworkInterfaceTL to wait to acquire a connection from each target host so that requests can be sent in deterministic order. However, since the command's timer is not set until the request is about to get sent after the connection is already acquired, all commands are effectively run without timeouts when this failpoint is enabled. That leads to various unexpected behaviors such as timeouts (see BF-16720 and BF-16680--). To solve this, we should add a predicate to the failpoint so that it can only be entered when the operation should be hedged.
- The tests use blockConnection to simulate delays on target nodes. When the failpoint is enabled, other commands could also block (e.g. mirrored commands). This causes the command to run longer than it should (the block time is 5mn) and sometimes lead to failures. To solve this, we should use mode {times: 1} and pass namespace into data so only one command will be blocked on each node.
- Unfortunately, _killOperations sent to the slower node by NetworkInterfaceTL when the command finishes could sometimes arrive and finish on the "slower" node before the command starts on that node. So the operation will never get killed. This is an undesirable behavior. However, it is caused by the delay on the host itself so there is no obvious way around it yet. For now, we should make the command block on the faster node for some time to allow the operation to start on the slower node.