[CSHARP-3430] DnsClient.NET failures in Kubernetes and WSL2 Created: 18/Feb/21  Updated: 28/Oct/23  Resolved: 18/Feb/21

Status: Closed
Project: C# Driver
Component/s: Connectivity
Affects Version/s: None
Fix Version/s: 2.12.0

Type: Bug Priority: Major - P3
Reporter: James Kovacs Assignee: James Kovacs
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to CSHARP-4001 SRV and TXT DNS Failures - Duplicated... Closed
Case:

 Description   

DnsClient.NET 1.3.1 interprets certain non-fatal DNS errors as fatal errors. These non-fatal errors occur most frequently when using Azure Kubernetes Service (AKS) or Windows Subsystem for Linux (WSL/WSL2). This manifests as a Header id mismatch exception, which results in a DNS failure. Details are in DnsClient.NET issue #79. The issue has been resolved in DnsClient 1.4.0.

This was originally reported in PR#459.



 Comments   
Comment by Githook User [ 18/Feb/21 ]

Author:

{'name': 'James Kovacs', 'email': 'jkovacs@post.harvard.edu', 'username': 'JamesKovacs'}

Message: CSHARP-3430: Updated System.Buffers from 4.4.0 to 4.5.1 as required by DnsClient.NET 1.4.0.
Branch: master
https://github.com/mongodb/mongo-csharp-driver/commit/254e91405ba6e9c7016c7d3f4fd239090faf0fe0

Comment by James Kovacs [ 18/Feb/21 ]

From Jorik on PR#459:

Hello @JamesKovacs

This is the exception we're getting:

System.TimeoutException: A timeout occured after 30000ms selecting a server using CompositeServerSelector{ Selectors = MongoDB.Driver.MongoClient+AreSessionsSupportedServerSelector, LatencyLimitingServerSelector{ AllowedLatencyRange = 00:00:00.0150000 } }. Client view of cluster state is { ClusterId : "1", ConnectionMode : "ReplicaSet", Type : "ReplicaSet", State : "Disconnected", Servers : [], DnsMonitorException : "DnsClient.DnsResponseException: Header id mismatch.
   at DnsClient.DnsUdpMessageHandler.Query(IPEndPoint server, DnsRequestMessage request, TimeSpan timeout)
   at DnsClient.LookupClient.ResolveQuery(IReadOnlyList`1 servers, DnsQuerySettings settings, DnsMessageHandler handler, DnsRequestMessage request, LookupClientAudit audit)
   at DnsClient.LookupClient.QueryInternal(DnsQuestion question, DnsQuerySettings queryOptions, IReadOnlyCollection`1 servers)
   at MongoDB.Driver.Core.Misc.DnsClientWrapper.ResolveSrvRecords(String service, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Clusters.DnsMonitor.Monitor()" }.
   at MongoDB.Driver.Core.Clusters.Cluster.ThrowTimeoutException(IServerSelector selector, ClusterDescription description)
   at MongoDB.Driver.Core.Clusters.Cluster.WaitForDescriptionChangedHelper.HandleCompletedTask(Task completedTask)
   at MongoDB.Driver.Core.Clusters.Cluster.WaitForDescriptionChangedAsync(IServerSelector selector, ClusterDescription description, Task descriptionChangedTask, TimeSpan timeout, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Clusters.Cluster.SelectServerAsync(IServerSelector selector, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoClient.AreSessionsSupportedAfterSeverSelctionAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.MongoClient.AreSessionsSupportedAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.MongoClient.StartImplicitSessionAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.UsingImplicitSessionAsync[TResult](Func`2 funcAsync, CancellationToken cancellationToken)
   at MongoDB.Driver.IAsyncCursorSourceExtensions.SingleOrDefaultAsync[TDocument](IAsyncCursorSource`1 source, CancellationToken cancellationToken)

This is caused by AKS sending 2 DNS packets for a query. This is the support ticket we sent to AKS, and they confirmed the issue:

1. Install a custom configmap for coredns using `log.override: log` so that we can see all traffic hitting the coredns pods
2. Run a debian pod and install dnsutils and tcpdump packages
3. Start tcpdump and issue a dig, e.g. `dig microsoft.com`
4. Check that the UDP packet is sent correctly and once to kube-dns service, take note of the DNS request ID:
`21:38:02.987751 IP dotnet-playground-debian.47859 - kube-dns.kube-system.svc.cluster.local.53: 53739+ [1au] A? microsoft.com. (54)`
4. You should expect in the logs of one coredns pod to receive this packet once, but most of the times it is received twice (within the same coredns pod and with the same request ID):
`[INFO] 10.244.0.232:42274 - 53739 'A IN microsoft.com. udp 54 false 4096' NOERROR qr,aa,rd,ra 176 0.000154298s`
`[INFO] 10.244.0.232:42274 - 53739 'A IN microsoft.com. udp 54 false 4096' NOERROR qr,aa,rd,ra 176 0.000225796s`
5. Those two responses are sent back to our debian pod as a response to the dig command:
`21:38:02.989997 IP kube-dns.kube-system.svc.cluster.local.53 - dotnet-playground-debian.47859: 53739 5/0/1 A 13.77.161.179, A 40.76.4.15, A 40.113.200.201, A 40.112.72.205, A 104.215.148.63 (187)`
`21:38:03.102732 IP kube-dns.kube-system.svc.cluster.local.53 - dotnet-playground-debian.47859: 53739 5/0/1 A 40.112.72.205, A 40.113.200.201, A 13.77.161.179, A 40.76.4.15, A 104.215.148.63 (187)`
6. It looks like 75% percent of the DNS requests behaves likes this. We tried some tweaks to the pods' dnsConfig like `single-request-reopen` or changing dnsPolicy, to no avail.

Comment by James Kovacs [ 18/Feb/21 ]

DnsClient.NET 1.4.0 upgrades System.Buffers from 4.4.0 to 4.5.1, which we will have to upgrade as well.

Generated at Wed Feb 07 21:45:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.