Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.0
Affects Version/s: None
Component/s: Tests
Labels:
None

Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

While working on ~~PYTHON-2486~~ I ran into an issue with the default localThresholdMS which I described to the team in slack:
I’m seeing some strange behavior while benchmarking the last piece of avoid connection storms and finally figured it out. The benchmark has a client connected to a 3 member replica set and performs a number of find operations with secondary read preference. After the test is done, my app gets the Pool for each secondary to report how many total connections were created (using topology.select_servers(secondary_server_selector)). The issue is that sometimes this would only return one secondary instead of two.

So I added SDAM loggers to see why the server was being marked unknown, and nothing… The secondary state was always known.

Finally I remembered localThresholdMS which defaults to 15 and looked at the RTTs:

Cluster: <TopologyDescription id: 60272e17b19196ca490135df, topology_type: ReplicaSetWithPrimary, servers: [
<ServerDescription ('localhost', 27017) server_type: RSSecondary, rtt: 0.09419754099999977>,
<ServerDescription ('localhost', 27018) server_type: RSSecondary, rtt: 0.05396648700000384>,
<ServerDescription ('localhost', 27019) server_type: RSPrimary, rtt: 0.06699502099999677>]>

Sure enough, one of the secondary’s RTT is way outside the 15ms latency window so it is excluded from server selection. But the question now is: why are the RTTs so high?! I assume it’s because the benchmark runs a ton of threads which delays the Monitor thread from running in a timely manner. So the Monitor thread thinks it took 100ms to get a response from the server but the real RTT is more like 0.5ms.

So I have a few takeaways here:

I need to override localThresholdMS to prevent this from impacting the benchmark results
This could explain a lot of the flakey tests we see in evergreen.
Is there any way to improve this situation? Maybe 15ms is too low for a default localThresholdMS given that, under load, the RTT measurement can vary widely.

Some of the test failures this might explain are:

~~PYTHON-2534~~ test_pool_paused_error_is_retryable
~~PYTHON-2526~~ test_server_selection_in_window.TestProse.test_load_balancing

related to

PYTHON-2526 Test failure - test_server_selection_in_window.TestProse.test_load_balancing

Closed

PYTHON-2534 Test failure - test_pool_paused_error_is_retryable

Closed

Assignee:: Shane Harvey
Reporter:: Shane Harvey
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Feb 17 2021 09:02:06 PM UTC
Updated:: Oct 29 2023 02:29:12 AM UTC
Resolved:: Apr 21 2021 07:05:14 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates