[GODRIVER-1966] Strange delays with failCommand in v1.4.0+ Created: 14/Apr/21 Updated: 27/Oct/23 Resolved: 16/Apr/21 |
|
| Status: | Closed |
| Project: | Go Driver |
| Component/s: | API, Core API |
| Affects Version/s: | 1.4.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ethan Lie | Assignee: | Benji Rewis (Inactive) |
| Resolution: | Gone away | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Local testing on: |
||
| Attachments: |
|
| Description |
|
We see delays when we upgrade our mongo driver from `v1.3.2` to `v1.4.0`. Attached is the `main.go` used to generate the following output. It will connect a client, prepare the test space, then perform 3 read (`find`) queries and 3 write (`insert`) queries with and without the failCommand enabled. Below is a snippet of the `find` with failCommand enabled.
Output with `v1.3.2`: ``` == Testing Find queries failpoint ENABLED ==
``` First: ~500m Second/third: ~1s
Output with `v1.4.0`: ``` == Testing Find queries failpoint ENABLED ==
``` First: ~10s Second/third: ~20s
The query latency jumps x10. The time to clear the failpoint does as well.
The same pattern appears for the write query. Impact: As a result we cannot depend on our unit tests as all queries with failCommand have this delay. |
| Comments |
| Comment by Benji Rewis (Inactive) [ 16/Apr/21 ] |
|
No problem! Feel free to comment on this ticket with any further questions or concerns. |
| Comment by Ethan Lie [ 16/Apr/21 ] |
|
That all makes sense to me. Thanks for the timely responses and advice. I don't think we have any further questions or issues. |
| Comment by Benji Rewis (Inactive) [ 16/Apr/21 ] |
|
Happy to help eliesrs@gmail.com ! According to our specifications on server discovery and monitoring, “not master” and “node is recovering” errors mark the server as unknown. You can see a list of codes and names here. As for the doubling latency, one hypothesis is that the error code you’re using, 10107, is retryable for reads like Find. The Find query is sent to the server, 10107 is returned (although with no state change), the server is marked unknown, a heartbeat after 10 seconds discovers the Find has failed. But, the first Find query retries, causing a delay of another 10 seconds in the timing of the second Find query. In total, the second find query takes 20 seconds. The same thing happens with the third Find query due to a retry of the second Find query. You can see that latency does not double when using writes, like Insert, as you have in your code already. This is because retryable writes are not enabled on standalone configurations. You can also see that latency does not double when using a non-retryable error code, such as 202. So, if you still wish to avoid the doubling in latency in your test, using a non-retryable error code is probably a good idea. |
| Comment by Ethan Lie [ 15/Apr/21 ] |
|
Hey Benji, thanks for the quick response. I think I understand. I also I see that an error code that presumably doesn't mark the server "unknown" (I tested with 202, NetworkInterfaceExceededTimeLimit) doesn't trigger the delays. Between this and the heartbeat interval we should be unblocked, thanks. Is there some more info that we can provide to help track down the root cause of the doubling of the latency? EDIT: Also, where can we find a list of errors that mark the server this way? |
| Comment by Benji Rewis (Inactive) [ 15/Apr/21 ] |
|
Hello eliesrs@gmail.com ! Thanks for your report. We’ve reproduced the delay on 1.4.0+. In v1.4.0, the Go driver switched from monitoring servers by polling periodically to streaming responses from servers when a state change occurs (with It is less clear why the query delay doubles after the first find or write, but for now you should be able to use SetHeartbeatInterval on the client to lower the heartbeat interval to something like 500ms, which we do for testing. That should fix the delay on 1.4.0+. |
| Comment by Ethan Lie [ 14/Apr/21 ] |
|
These same delays are seen in v1.5.1 as well (latest). |