[CDRIVER-3793] Client pool retryable reads performance regression on evergreen Created: 08/Sep/20 Updated: 27/Oct/23 Resolved: 10/Sep/20 |
|
| Status: | Closed |
| Project: | C Driver |
| Component/s: | None |
| Affects Version/s: | 1.17.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Patrick Freed | Assignee: | Kevin Albertson |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Description |
|
When encountering a retryable error on a client derived from a pool, it can take a much longer time than normal to complete the retry on 1.17 than in previous versions. Interestingly, this has only been observed on evergreen hosts and not on any of my local machines. Attached is a repro that times how long a single retry takes pooled vs non-pooled. Output on 1.17.0:
Output on 1.16.3:
|
| Comments |
| Comment by Kevin Albertson [ 10/Sep/20 ] | |
|
Requesting an immediate scan while an awaitable ismaster command is in progress is ignored
The expectation is that if the server returns an error indicating a state change (e.g shutting down, or transitioning from primary to secondary) it will increment the topologyVersion and necessarily respond to any in-progress awaitable ismaster commands with the new state. But the error via failpoint does not simulate the server actually changing state. | |
| Comment by Patrick Freed [ 10/Sep/20 ] | |
|
I realize my personal machines were all running 4.2.x, which is why I wasn't able to reproduce this locally but could on evergreen. I can confirm the slowdown is experienced on my local machine when running against 4.4. I can also confirm that setting heartbeatFrequencyMS to 500 does speed up the pooled test. Shouldn't this not be required though, since once server selection fails an immediate topology check is requested as per the psuedocode in https://github.com/mongodb/specifications/blob/master/source/server-selection/server-selection.rst#single-threaded-server-selection-implementation? I'm not really familiar with streamable ismaster, so an "immediate topology check" may mean something different in the streamable version though. | |
| Comment by Kevin Albertson [ 10/Sep/20 ] | |
|
Thanks for the repro! This only occurs on 4.4+ when the error code is a state change error (i.e. it causes the server to be marked Unknown). This appears to be a side-effect of streamable ismaster, and is actually expected. patrick.freed Can you confirm that setting heartbeatFrequencyMS=500 speeds up the pooled test for you? It did for me, and other drivers are doing the same for retryable tests. Details The test sets a failpoint to have a find return an InterruptedAtShutdown error, and expects it to succeed after retrying. The test takes a full heartbeatFrequencyMS (10 seconds) to rediscover the server after it is marked Unknown. This is expected behavior. After initial discovery, the server monitor starts an awaitable ismaster command. The failpoint causes the find command to return:
That marks the server Unknown (per the handleStateChangeError logic in SDAM), but does it not cancel the in progress awaitable ismaster command (because it is not a network error). The server does not increment the topology version (though that is being considered in |