Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-53662

Investigate if slow DNS resolution warnings are a symptom of an internal slowdown

    • Type: Icon: Task Task
    • Resolution: Gone away
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Service Arch
    • 5

      We have recently seen a couple of HELP tickets (HELP-20454 and HELP-20891) where the mongos logs indicate very slow DNS resolutions:

      2020-12-02T14:03:44.344+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-00-00-rvhot.mongodb.net:27017 took 1621ms
      2020-12-02T14:03:57.261+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-01-01-rvhot.mongodb.net:27017 took 2593ms
      2020-12-02T14:03:57.261+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-01-00-rvhot.mongodb.net:27017 took 2593ms
      2020-12-02T14:04:23.098+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-00-02-rvhot.mongodb.net:27017 took 5454ms
      2020-12-02T14:04:23.179+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-04-02-rvhot.mongodb.net:27017 took 5511ms
      2020-12-02T14:04:23.179+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-04-00-rvhot.mongodb.net:27017 took 5511ms
      2020-12-02T14:04:23.188+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-05-01-rvhot.mongodb.net:27017 took 5520ms

      Followed by:

      2020-12-02T18:25:06.443+0000 I  NETWORK  [TaskExecutorPool-0] Marking host baas-shard-03-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
      2020-12-02T18:25:06.443+0000 I  NETWORK  [conn2551822] Marking host baas-shard-05-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
      2020-12-02T18:25:42.843+0000 I  NETWORK  [conn2554900] Marking host baas-shard-04-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
      2020-12-02T18:25:42.843+0000 I  NETWORK  [conn2554900] Marking host baas-shard-06-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
      2020-12-02T18:25:42.843+0000 I  NETWORK  [conn2554900] Marking host baas-shard-02-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
      2020-12-02T18:25:42.843+0000 I  NETWORK  [conn2554457] Marking host baas-shard-01-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
      2020-12-02T18:25:42.844+0000 I  NETWORK  [conn2550821] Marking host baas-shard-03-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
      2020-12-02T18:25:42.852+0000 I  NETWORK  [TaskExecutorPool-0] Marking host baas-shard-05-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit

      In at least one of the tickets, these errors correlate exactly with a performance degradation in that mongos (https://jira.mongodb.org/browse/HELP-20454?focusedCommentId=3532624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-3532624) where active sessions get backed up and accumulate.

      In both tickets, we believe that there was no real DNS infrastructure issue, as reported by the sys admins and also by the fact that no other node in the cluster showed the DNS warnings. In HELP-20891 they even added static entries so it should be resolved locally, but the warnings continue.

      We suspect that these "slow DNS resolutions" may perhaps actually be a symptom of something getting slowed down within the mongos itself.

      Both HELP tickets are running a 4.2 cluster.

      Acceptance Criteria: 

      Generate a consistent repro or hypothesis for this issue and generate a SERVER ticket to fix it, if needed. 

            Assignee:
            backlog-server-servicearch [DO NOT USE] Backlog - Service Architecture
            Reporter:
            jordi.serra-torrens@mongodb.com Jordi Serra Torrens
            Votes:
            0 Vote for this issue
            Watchers:
            18 Start watching this issue

              Created:
              Updated:
              Resolved: