-
Type:
Bug
-
Resolution: Works as Designed
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Service Arch
-
ALL
-
Workload Scheduling 2024-07-22
-
0
-
None
-
None
-
None
-
None
-
None
-
None
-
None
BF-33912's lone BFG at the time of writing appears to have been caused by some replication error in a TSAN variant (which is quite slow) leading to a host being down for longer than usual (see the comments here for details).
This caused the client threads to receive a mix of "Connection reset by peer," "Connection refused," and "HostUnreachable" errors, but only HostUnreachable is considered a retriable error that will not consume the retry limit.
In suites where we kill/terminate shard processes, it should be expected to receive network errors more frequently (and that they should be transient).