Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Gone away
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- servicearch-wfbf-day

Assigned Teams:

Service Arch
Story Points:
5
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We have recently seen a couple of HELP tickets (HELP-20454 and HELP-20891) where the mongos logs indicate very slow DNS resolutions:

2020-12-02T14:03:44.344+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-00-00-rvhot.mongodb.net:27017 took 1621ms
2020-12-02T14:03:57.261+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-01-01-rvhot.mongodb.net:27017 took 2593ms
2020-12-02T14:03:57.261+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-01-00-rvhot.mongodb.net:27017 took 2593ms
2020-12-02T14:04:23.098+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-00-02-rvhot.mongodb.net:27017 took 5454ms
2020-12-02T14:04:23.179+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-04-02-rvhot.mongodb.net:27017 took 5511ms
2020-12-02T14:04:23.179+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-04-00-rvhot.mongodb.net:27017 took 5511ms
2020-12-02T14:04:23.188+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-05-01-rvhot.mongodb.net:27017 took 5520ms

Followed by:

2020-12-02T18:25:06.443+0000 I  NETWORK  [TaskExecutorPool-0] Marking host baas-shard-03-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:06.443+0000 I  NETWORK  [conn2551822] Marking host baas-shard-05-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:42.843+0000 I  NETWORK  [conn2554900] Marking host baas-shard-04-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:42.843+0000 I  NETWORK  [conn2554900] Marking host baas-shard-06-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:42.843+0000 I  NETWORK  [conn2554900] Marking host baas-shard-02-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:42.843+0000 I  NETWORK  [conn2554457] Marking host baas-shard-01-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:42.844+0000 I  NETWORK  [conn2550821] Marking host baas-shard-03-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:42.852+0000 I  NETWORK  [TaskExecutorPool-0] Marking host baas-shard-05-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit

In at least one of the tickets, these errors correlate exactly with a performance degradation in that mongos (https://jira.mongodb.org/browse/HELP-20454?focusedCommentId=3532624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-3532624) where active sessions get backed up and accumulate.

In both tickets, we believe that there was no real DNS infrastructure issue, as reported by the sys admins and also by the fact that no other node in the cluster showed the DNS warnings. In HELP-20891 they even added static entries so it should be resolved locally, but the warnings continue.

We suspect that these "slow DNS resolutions" may perhaps actually be a symptom of something getting slowed down within the mongos itself.

Both HELP tickets are running a 4.2 cluster.

Acceptance Criteria:

Generate a consistent repro or hypothesis for this issue and generate a SERVER ticket to fix it, if needed.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Hide
baas-shard-00-01-rvhot.mongodb.net_2020-12-02T14_00_00_2020-12-02T15_30_00_mongos.zip
Jan 08 2021 05:18:01 PM UTC
113.87 MB
Jordi Serra Torrens
Extracting archive...
Show
baas-shard-00-01-rvhot.mongodb.net_2020-12-02T14_00_00_2020-12-02T15_30_00_mongos.zip
Jan 08 2021 05:18:01 PM UTC
113.87 MB
Jordi Serra Torrens

Assignee:: [DO NOT USE] Backlog - Service Architecture
Reporter:: Jordi Serra Torrens
Participants:: [DO NOT USE] Backlog - Service Architecture, Amirsaman Memaripour, Jordi Serra Torrens
Votes:: 0 Vote for this issue
Watchers:: 18 Start watching this issue

Created:: Jan 08 2021 05:18:47 PM UTC
Updated:: Nov 21 2024 01:20:09 AM UTC
Resolved:: Nov 19 2021 10:26:06 PM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates