[SERVER-63792] Improve coverage of blackholing network requests Created: 17/Feb/22  Updated: 06/Dec/22

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-63417 Oplog fetcher should not retry when a... Closed
is related to SERVER-63512 Use optimized (no isSelf calls) recon... Closed
Assigned Teams:
Replication
Participants:

 Description   

This is likely a gap in our test coverage that can lead to longer unavailability windows than we'd like.



 Comments   
Comment by Robert Guo (Inactive) [ 22/Feb/22 ]

Hm interesting question. We have the datacenter delay simulation mechanism in DSI using tc that could trigger TCP retransmissions with a couple of small tweaks; the same invocation can be run in JS tests. But if we want to just test that this sequence of events do not cause additional delays, mongobridge's delayMessagesFrom and a timer might be sufficient? The latter may be more deterministic and quicker.

Another thing to systematically catch this type of issue that we should be able to do if there's enough value/interest is to add/extend a passthrough that delays commands at random with a fixed seed. If the resulting delay to running the next command is beyond some threshold, e.g. > 5x the delay amount of the original command, the test reports this increase in latency to the user.

Comment by Judah Schvimer [ 18/Feb/22 ]

We should ensure to include testing of sharded clusters with black holes between mongos and mongod and between shards as well.

Generated at Thu Feb 08 05:58:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.