While working on
PYTHON-2486 I ran into an issue with the default localThresholdMS which I described to the team in slack:
I’m seeing some strange behavior while benchmarking the last piece of avoid connection storms and finally figured it out. The benchmark has a client connected to a 3 member replica set and performs a number of find operations with secondary read preference. After the test is done, my app gets the Pool for each secondary to report how many total connections were created (using topology.select_servers(secondary_server_selector)). The issue is that sometimes this would only return one secondary instead of two.
So I added SDAM loggers to see why the server was being marked unknown, and nothing… The secondary state was always known.
Finally I remembered localThresholdMS which defaults to 15 and looked at the RTTs:
Sure enough, one of the secondary’s RTT is way outside the 15ms latency window so it is excluded from server selection. But the question now is: why are the RTTs so high?! I assume it’s because the benchmark runs a ton of threads which delays the Monitor thread from running in a timely manner. So the Monitor thread thinks it took 100ms to get a response from the server but the real RTT is more like 0.5ms.
So I have a few takeaways here:
- I need to override localThresholdMS to prevent this from impacting the benchmark results
- This could explain a lot of the flakey tests we see in evergreen.
- Is there any way to improve this situation? Maybe 15ms is too low for a default localThresholdMS given that, under load, the RTT measurement can vary widely.
Some of the test failures this might explain are: