-
Type:
Task
-
Resolution: Gone away
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Service Arch
-
5
-
None
-
None
-
None
-
None
-
None
-
None
-
None
We have recently seen a couple of HELP tickets (HELP-20454 and HELP-20891) where the mongos logs indicate very slow DNS resolutions:
2020-12-02T14:03:44.344+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-00-00-rvhot.mongodb.net:27017 took 1621ms 2020-12-02T14:03:57.261+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-01-01-rvhot.mongodb.net:27017 took 2593ms 2020-12-02T14:03:57.261+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-01-00-rvhot.mongodb.net:27017 took 2593ms 2020-12-02T14:04:23.098+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-00-02-rvhot.mongodb.net:27017 took 5454ms 2020-12-02T14:04:23.179+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-04-02-rvhot.mongodb.net:27017 took 5511ms 2020-12-02T14:04:23.179+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-04-00-rvhot.mongodb.net:27017 took 5511ms 2020-12-02T14:04:23.188+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-05-01-rvhot.mongodb.net:27017 took 5520ms
Followed by:
2020-12-02T18:25:06.443+0000 I NETWORK [TaskExecutorPool-0] Marking host baas-shard-03-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit 2020-12-02T18:25:06.443+0000 I NETWORK [conn2551822] Marking host baas-shard-05-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit 2020-12-02T18:25:42.843+0000 I NETWORK [conn2554900] Marking host baas-shard-04-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit 2020-12-02T18:25:42.843+0000 I NETWORK [conn2554900] Marking host baas-shard-06-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit 2020-12-02T18:25:42.843+0000 I NETWORK [conn2554900] Marking host baas-shard-02-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit 2020-12-02T18:25:42.843+0000 I NETWORK [conn2554457] Marking host baas-shard-01-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit 2020-12-02T18:25:42.844+0000 I NETWORK [conn2550821] Marking host baas-shard-03-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit 2020-12-02T18:25:42.852+0000 I NETWORK [TaskExecutorPool-0] Marking host baas-shard-05-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
In at least one of the tickets, these errors correlate exactly with a performance degradation in that mongos (https://jira.mongodb.org/browse/HELP-20454?focusedCommentId=3532624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-3532624) where active sessions get backed up and accumulate.
In both tickets, we believe that there was no real DNS infrastructure issue, as reported by the sys admins and also by the fact that no other node in the cluster showed the DNS warnings. In HELP-20891 they even added static entries so it should be resolved locally, but the warnings continue.
We suspect that these "slow DNS resolutions" may perhaps actually be a symptom of something getting slowed down within the mongos itself.
Both HELP tickets are running a 4.2 cluster.
Acceptance Criteria:
Generate a consistent repro or hypothesis for this issue and generate a SERVER ticket to fix it, if needed.