Details
-
Task
-
Resolution: Gone away
-
Major - P3
-
None
-
None
-
None
-
Service Arch
-
5
Description
We have recently seen a couple of HELP tickets (HELP-20454 and HELP-20891) where the mongos logs indicate very slow DNS resolutions:
2020-12-02T14:03:44.344+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-00-00-rvhot.mongodb.net:27017 took 1621ms
|
2020-12-02T14:03:57.261+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-01-01-rvhot.mongodb.net:27017 took 2593ms
|
2020-12-02T14:03:57.261+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-01-00-rvhot.mongodb.net:27017 took 2593ms
|
2020-12-02T14:04:23.098+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-00-02-rvhot.mongodb.net:27017 took 5454ms
|
2020-12-02T14:04:23.179+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-04-02-rvhot.mongodb.net:27017 took 5511ms
|
2020-12-02T14:04:23.179+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-04-00-rvhot.mongodb.net:27017 took 5511ms
|
2020-12-02T14:04:23.188+0000 W NETWORK [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-05-01-rvhot.mongodb.net:27017 took 5520ms
|
Followed by:
2020-12-02T18:25:06.443+0000 I NETWORK [TaskExecutorPool-0] Marking host baas-shard-03-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
|
2020-12-02T18:25:06.443+0000 I NETWORK [conn2551822] Marking host baas-shard-05-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
|
2020-12-02T18:25:42.843+0000 I NETWORK [conn2554900] Marking host baas-shard-04-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
|
2020-12-02T18:25:42.843+0000 I NETWORK [conn2554900] Marking host baas-shard-06-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
|
2020-12-02T18:25:42.843+0000 I NETWORK [conn2554900] Marking host baas-shard-02-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
|
2020-12-02T18:25:42.843+0000 I NETWORK [conn2554457] Marking host baas-shard-01-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
|
2020-12-02T18:25:42.844+0000 I NETWORK [conn2550821] Marking host baas-shard-03-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
|
2020-12-02T18:25:42.852+0000 I NETWORK [TaskExecutorPool-0] Marking host baas-shard-05-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
|
In at least one of the tickets, these errors correlate exactly with a performance degradation in that mongos (https://jira.mongodb.org/browse/HELP-20454?focusedCommentId=3532624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-3532624) where active sessions get backed up and accumulate.
In both tickets, we believe that there was no real DNS infrastructure issue, as reported by the sys admins and also by the fact that no other node in the cluster showed the DNS warnings. In HELP-20891 they even added static entries so it should be resolved locally, but the warnings continue.
We suspect that these "slow DNS resolutions" may perhaps actually be a symptom of something getting slowed down within the mongos itself.
Both HELP tickets are running a 4.2 cluster.
Acceptance Criteria:
Generate a consistent repro or hypothesis for this issue and generate a SERVER ticket to fix it, if needed.