[GODRIVER-2787] Arbiter connection permanently broken after initiating the replica set Created: 30/Mar/23 Updated: 28/Oct/23 Resolved: 13/Jun/23 |
|
| Status: | Closed |
| Project: | Go Driver |
| Component/s: | Connections, Server Selection |
| Affects Version/s: | None |
| Fix Version/s: | 1.11.7 |
| Type: | Bug | Priority: | Unknown |
| Reporter: | Aleksander Polak | Assignee: | Matt Dale |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Documentation Changes: | Not Needed | ||||||||||||
| Documentation Changes Summary: | 1. What would you like to communicate to the user about this feature? |
||||||||||||
| Description |
SummaryIn my scenario, I have an arbiter from which I collect metrics using the Percona MongoDB exporter (https://github.com/percona/mongodb_exporter). The exporter uses the Go driver to establish a direct connection to the arbiter and execute commands like getDiagnosticData and replSetGetStatus every 10 seconds. The exporter connects to the arbiter as soon as it starts and reuses the same connection throughout its entire life. When the mongod instance and the MongoDB exporter start at the same time (which is typically the case as they reside in the same Kubernetes pod), the connection becomes permanently broken. The following error is returned by the Go driver on every attempt to execute any command:
Restarting only the MongoDB exporter or disabling the global connection pool (via the --no-mongodb.global-conn-pool flag) fixes the problem. This is particularly easy to reproduce when the exporter connects to the soon-to-be arbiter instance before initiating the replica set. In such case, the exporter is initially able to execute commands but starts failing once the replica set is initiated. MongoDB driver version: v1.10.3 (the issue doesn't reproduce with v1.3.2 used by older versions of the MongoDB exporter) MongoDB server version: v4.4 or later Topology: replica set or sharded cluster + arbiter How to ReproduceReproducibility: 100% 1. Create the following prometheus.yml file:
2. Create the following docker-compose.yml file:
3. Start the containers in the background:
4. Monitor the exporter-arbiter container logs for 30-40 seconds:
Note that the getDiagnosticData command result is emitted every 10 seconds. 5. Initiate the replica set:
6. Monitor the exporter-arbiter container logs for another 30-40 seconds.
Note that the following entries are now emitted instead of the command results:
7. Restart the exporter-arbiter container:
8. Monitor the exporter-arbiter container logs for another 30-40 seconds:
Note that the command results are emitted again. 9. Stop and remove the containers:
Additional BackgroundUse the --no-mongodb.global-conn-pool flag instead of the --mongodb.global-conn-pool flag in the example above to disable the global connection pool. The issue doesn't reproduce when a new connection is used each time. I forked the v0.11.2 branch of the MongoDB exporter which uses the Go driver v1.3.2 which doesn't have this issue. After upgrading to v1.11.3, the issue started to appear. |
| Comments |
| Comment by Aleksander Polak [ 13/Jun/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
matt.dale@mongodb.com Thanks! I can confirm the issue doesn't reproduce with Go driver v1.11.7. I checked it against an older v2.35.0 branch of MongoDB exporter which doesn't include some of the latest changes that I believe were introduced as a workaround for | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matt Dale [ 12/Jun/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
aleksander.polak@tietoevry.com thanks for the additional info! You're correct that the topology change is related to the problem (specifically the topology version change, which is considered during error handling). The proposed fix in My local testing seems to show that mongodb_exporter built from the latest main revision (9fb40e2) resolves the previous issue. Are you able to build the latest version of mongodb_exporter? If so, can you confirm if it resolves your issue? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksander Polak [ 09/May/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Note that the problem only seems to occur if you connect to the mongod instance:
If you connect after the instance assumes the arbiter role, operations like client.ListDatabaseNames() still fail due to the "(NotPrimaryOrSecondary) node is not in primary or recovering state" error, but it doesn't cause any observable delays. I suspect that the problem is linked to a change in topology reported by the mongod instance (from standalone to replica set) during the lifetime of the connection. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Aleksander Polak [ 08/May/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi matt.dale@mongodb.com, thank you for the update! Based on the information you provided, I can give you some more details. MongoDB exporter seems to have a bug: it tries to parse the X-Prometheus-Scrape-Timeout-Seconds header value as an integer, but VictoriaMetrics passes a float instead: Therefore, the timeout in this scenario is always 10s, regardless of the scrape timeout, which was set to 9s. This explains why we see "context canceled": VictoriaMetrics disconnects from the MongoDB exporter before the timeout expires, causing the server to cancel the request. If the scrape timeout is extended to 20s, the error changes to "context deadline exceeded". Either way, the "server selection error" is directly caused by the request context cancellation/deadline being exceeded, not the server selection timeout managed by the driver, which is 30s by default. That said, there is certainly a problem with the driver. To isolate it, I wrote this program that makes the same sequence of driver calls as the MongoDB exporter:
When I use Go driver v1.3.7, after initiating the replica set I get:
When I use Go driver v1.4.0 or later (all the way up to v1.11.6), I get:
Note the timestamps: the duration of a single iteration went up from 2s to almost a minute. In particular, the operations that cause the "(NotPrimaryOrSecondary) node is not in primary or recovering state" error are now significantly slower (6-20s), and the operations immediately following them also take longer (10s). Re-connecting fixes the problem. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matt Dale [ 03/May/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hey aleksander.polak@tietoevry.com, I've been able to reproduce this and have a theory about the root cause of the issue. One of the major changes to the Go Driver since v1.3.x is what it does when it encounters specific errors. In this case, the mongodb_exporter is trying to list all non-system collections (see here), which is not a valid operation on an arbiter, so it always returns an error like
The Go Driver should just pass that error back to the caller, but due to what seems to be a bug in the new error handling logic, the Go Driver interprets that error as an indication that the database node should not be queried right now. The Go Driver should recover from that, but another bug prevents an immediate database status check, delaying it for 10 seconds, causing a persistent timeout. TLDR: The problem is a combination of mongodb_exporter running operations that will never succeed on an arbiter and bugs in the Go Driver error handling logic that cause it to delay operations to the arbiter and then not recover as quickly as it should. I will continue investigating and see if I can validate my bug theories. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matt Dale [ 10/Apr/23 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hey aleksander.polak@tietoevry.com thanks for the ticket and all the details! We're looking into it and will get back to you as soon as we have some information or questions. |