[SERVER-53662] Investigate if slow DNS resolution warnings are a symptom of an internal slowdown Created: 08/Jan/21  Updated: 27/Oct/23  Resolved: 19/Nov/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Jordi Serra Torrens Assignee: Backlog - Service Architecture
Resolution: Gone away Votes: 0
Labels: servicearch-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive baas-shard-00-01-rvhot.mongodb.net_2020-12-02T14_00_00_2020-12-02T15_30_00_mongos.zip    
Issue Links:
Related
Assigned Teams:
Service Arch
Participants:
Story Points: 5

 Description   

We have recently seen a couple of HELP tickets (HELP-20454 and HELP-20891) where the mongos logs indicate very slow DNS resolutions:

2020-12-02T14:03:44.344+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-00-00-rvhot.mongodb.net:27017 took 1621ms
2020-12-02T14:03:57.261+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-01-01-rvhot.mongodb.net:27017 took 2593ms
2020-12-02T14:03:57.261+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-01-00-rvhot.mongodb.net:27017 took 2593ms
2020-12-02T14:04:23.098+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-00-02-rvhot.mongodb.net:27017 took 5454ms
2020-12-02T14:04:23.179+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-04-02-rvhot.mongodb.net:27017 took 5511ms
2020-12-02T14:04:23.179+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-04-00-rvhot.mongodb.net:27017 took 5511ms
2020-12-02T14:04:23.188+0000 W  NETWORK  [TaskExecutorPool-0] DNS resolution while connecting to baas-shard-05-01-rvhot.mongodb.net:27017 took 5520ms

Followed by:

2020-12-02T18:25:06.443+0000 I  NETWORK  [TaskExecutorPool-0] Marking host baas-shard-03-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:06.443+0000 I  NETWORK  [conn2551822] Marking host baas-shard-05-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:42.843+0000 I  NETWORK  [conn2554900] Marking host baas-shard-04-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:42.843+0000 I  NETWORK  [conn2554900] Marking host baas-shard-06-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:42.843+0000 I  NETWORK  [conn2554900] Marking host baas-shard-02-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:42.843+0000 I  NETWORK  [conn2554457] Marking host baas-shard-01-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:42.844+0000 I  NETWORK  [conn2550821] Marking host baas-shard-03-01-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit
2020-12-02T18:25:42.852+0000 I  NETWORK  [TaskExecutorPool-0] Marking host baas-shard-05-02-rvhot.mongodb.net:27017 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit

In at least one of the tickets, these errors correlate exactly with a performance degradation in that mongos (https://jira.mongodb.org/browse/HELP-20454?focusedCommentId=3532624&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-3532624) where active sessions get backed up and accumulate.

In both tickets, we believe that there was no real DNS infrastructure issue, as reported by the sys admins and also by the fact that no other node in the cluster showed the DNS warnings. In HELP-20891 they even added static entries so it should be resolved locally, but the warnings continue.

We suspect that these "slow DNS resolutions" may perhaps actually be a symptom of something getting slowed down within the mongos itself.

Both HELP tickets are running a 4.2 cluster.

Acceptance Criteria: 

Generate a consistent repro or hypothesis for this issue and generate a SERVER ticket to fix it, if needed. 



 Comments   
Comment by Amirsaman Memaripour [ 19/Nov/21 ]

This issue has already been investigated and fixed as part of SERVER-59876.

Generated at Thu Feb 08 05:31:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.