[GODRIVER-1966] Strange delays with failCommand in v1.4.0+ Created: 14/Apr/21  Updated: 27/Oct/23  Resolved: 16/Apr/21

Status: Closed
Project: Go Driver
Component/s: API, Core API
Affects Version/s: 1.4.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ethan Lie Assignee: Benji Rewis (Inactive)
Resolution: Gone away Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Local testing on:
macOS 10.15.7
go1.16.3 darwin/amd64
tested against mongo docker image https://github.com/docker-library/mongo/blob/9db9e3d4704f5d963e424a3894fa740b8ce4ea70/4.4/Dockerfile


Attachments: File main.go    

 Description   

We see delays when we upgrade our mongo driver from `v1.3.2` to `v1.4.0`.

Attached is the `main.go` used to generate the following output.

It will connect a client, prepare the test space, then perform 3 read (`find`) queries and 3 write (`insert`) queries with and without the failCommand enabled. Below is a snippet of the `find` with failCommand enabled.

 

Output with `v1.3.2`:

```

== Testing Find queries failpoint ENABLED ==

  • Time elapsed to do: connect client: 38.833µs
  • Time elapsed to do: clear failpoint: 7.497043ms
  • Time elapsed to do: populate table: 995.285µs
  • Time elapsed to do: set failpoint: 890.543µs
  • Time elapsed to do: find query 0: 502.632805ms
    Encountered expected error: (mongo.CommandError): (NotWritablePrimary) Failing command due to 'failCommand' failpoint
  • Time elapsed to do: find query 1: 998.847787ms
    Encountered expected error: (mongo.CommandError): (NotWritablePrimary) Failing command due to 'failCommand' failpoint
  • Time elapsed to do: find query 2: 999.679731ms
    Encountered expected error: (mongo.CommandError): (NotWritablePrimary) Failing command due to 'failCommand' failpoint
  • Time elapsed to do: clear failpoint: 500.759202ms
  • Time elapsed to do: clear failpoint: 1.258652ms
  • Time elapsed to do: clean table post test: 1.381183ms
  • Time elapsed to do: disconnect client: 1.103236ms

```

First: ~500m

Second/third: ~1s

 

Output with `v1.4.0`:

```

== Testing Find queries failpoint ENABLED ==

  • Time elapsed to do: connect client: 41.041µs
  • Time elapsed to do: clear failpoint: 5.948902ms
  • Time elapsed to do: populate table: 946.695µs
  • Time elapsed to do: set failpoint: 849.93µs
  • Time elapsed to do: find query 0: 9.996635035s
    Encountered expected error: (mongo.CommandError): (NotWritablePrimary) Failing command due to 'failCommand' failpoint
  • Time elapsed to do: find query 1: 20.015427043s
    Encountered expected error: (mongo.CommandError): (NotWritablePrimary) Failing command due to 'failCommand' failpoint
  • Time elapsed to do: find query 2: 19.987305077s
    Encountered expected error: (mongo.CommandError): (NotWritablePrimary) Failing command due to 'failCommand' failpoint
  • Time elapsed to do: clear failpoint: 10.022557671s
  • Time elapsed to do: clear failpoint: 1.460051ms
  • Time elapsed to do: clean table post test: 1.456988ms
  • Time elapsed to do: disconnect client: 1.322098ms

```

First: ~10s

Second/third: ~20s

 

The query latency jumps x10. The time to clear the failpoint does as well.

  1. Why does this happen?
  2. Why does it appear to double for every query after the first?

The same pattern appears for the write query.

Impact: As a result we cannot depend on our unit tests as all queries with failCommand have this delay.



 Comments   
Comment by Benji Rewis (Inactive) [ 16/Apr/21 ]

No problem! Feel free to comment on this ticket with any further questions or concerns.

Comment by Ethan Lie [ 16/Apr/21 ]

That all makes sense to me.

Thanks for the timely responses and advice. I don't think we have any further questions or issues.

Comment by Benji Rewis (Inactive) [ 16/Apr/21 ]

Happy to help eliesrs@gmail.com !

According to our specifications on server discovery and monitoring, “not master” and “node is recovering” errors mark the server as unknown. You can see a list of codes and names here.

As for the doubling latency, one hypothesis is that the error code you’re using, 10107, is retryable for reads like Find. The Find query is sent to the server, 10107 is returned (although with no state change), the server is marked unknown, a heartbeat after 10 seconds discovers the Find has failed. But, the first Find query retries, causing a delay of another 10 seconds in the timing of the second Find query. In total, the second find query takes 20 seconds. The same thing happens with the third Find query due to a retry of the second Find query.

You can see that latency does not double when using writes, like Insert, as you have in your code already. This is because retryable writes are not enabled on standalone configurations. You can also see that latency does not double when using a non-retryable error code, such as 202.

So, if you still wish to avoid the doubling in latency in your test, using a non-retryable error code is probably a good idea.

Comment by Ethan Lie [ 15/Apr/21 ]

Hey Benji, thanks for the quick response.

I think I understand. I also I see that an error code that presumably doesn't mark the server "unknown" (I tested with 202, NetworkInterfaceExceededTimeLimit) doesn't trigger the delays. Between this and the heartbeat interval we should be unblocked, thanks.

Is there some more info that we can provide to help track down the root cause of the doubling of the latency?

EDIT: Also, where can we find a list of errors that mark the server this way?

Comment by Benji Rewis (Inactive) [ 15/Apr/21 ]

Hello eliesrs@gmail.com ! Thanks for your report. We’ve reproduced the delay on 1.4.0+.

In v1.4.0, the Go driver switched from monitoring servers by polling periodically to streaming responses from servers when a state change occurs (with GODRIVER-1489). Failpoints do not cause an actual server state change; they only simulate one and raise an error. In this case, that error (NotWritablePrimary) marks the server as “unknown”, and as of 1.4.0, monitoring won’t get an updated response from the server until a heartbeat is sent after the default 10 second HeartbeatInterval. This is why your initial queries take 10 seconds to fail.

It is less clear why the query delay doubles after the first find or write, but for now you should be able to use SetHeartbeatInterval on the client to lower the heartbeat interval to something like 500ms, which we do for testing. That should fix the delay on 1.4.0+.

Comment by Ethan Lie [ 14/Apr/21 ]

These same delays are seen in v1.5.1 as well (latest).

Generated at Thu Feb 08 08:37:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.