-
Type: Spec Change
-
Resolution: Unresolved
-
Priority: Unknown
-
None
-
Component/s: Initial DNS Seedlist Discovery, SRV Polling
-
None
-
Needed
-
(copied to CRM)
Summary
Provide way to prefer TCP for SRV lookup
Background & Motivation
DNS resolution is expected to first try with UDP, then retry with TCP if the UDP response indicates truncation.
HELP-59749 notes a case where a customer observed a subset of SRV records returned in the UDP response, but the truncation flag (TC bit) was not set:
TCP fallback does not work on their DNS records because the DNS server does not support TC bit in the response header.
As a result, a changing subset of SRV records was applied each time SRV records are polled. I expect this results in repeated closing/opening of connections as servers are removed/added.
DNS and Truncation in UDP suggests this may not be isolated to the customer:
some 72,000 cases (91% of all such cases) where the resolver appears to be using truncated DNS response data occur for users located in just three networks, all located in China.
Proposal: add way to opt-in to using TCP to resolve SRV records first (rather than on retry). Consider adding a URI option: srvPreferTCP.
Alternatives
Using TCP initally by default is another option. RFC-7766 notes:
TCP ought to be considered a valid alternative transport to UDP, not purely a fallback option.
But also describes possible disadvantages in Appendix A.
Testing
To observe TCP-retry behavior, use Wireshark to capture DNS. In my case, I disabled CloudFlare WARP to disable DNS-over-HTTPS and ran the following Python:
from pymongo import MongoClient client = MongoClient("mongodb+srv://test1.kevinalbs.com")
There were 30 SRV records for _mongodb._tcp.test1.kevinalbs.com. This resulted in the UDP response being truncated. In my case, the TC bit is (expectedly) set and the TCP retry occurs:
I have not reliably reproduced the issue in HELP-59749 (UDP response is truncated, but TC bit not set).
How does this affect the end user?
In the case of HELP-59749, a changing subset of SRV records was applied each time SRV records are polled. I expect this results in repeated closing/opening of connections as servers are removed/added.
How likely is it that this problem or use case will occur?
This occurred in HELP-59749. I expect this impacts multiple drivers (PyMongo, Go, Rust, C, all queried with UDP first).
DNS and Truncation in UDP suggests this may not be isolated to the customer. However, the article suggests this impacts a small percentage of DNS environments.
If the problem does occur, what are the consequences and how severe are they?
In the case of HELP-59749, a changing subset of SRV records was applied each time SRV records are polled. SRV records had a TTL of one minute. I expect this results in repeated closing/opening of connections as servers are removed/added.
The truncated records result in less mongos servers being available for the driver to use. In the case of HELP-59729, 9 mongos servers were expected, 6 were applied due to truncation.
Is this issue urgent?
No? HELP-59749 is urgent, but a C-driver-specific solution was made in CDRIVER-5589.
Acceptance Criteria
When implemented (and enabled), SRV records will be queried with TCP.
- related to
-
CDRIVER-5589 Add option to prefer TCP for SRV lookup
- Closed