Uploaded image for project: 'Drivers'
  1. Drivers
  2. DRIVERS-2919

Provide way to prefer TCP for SRV lookup

    • Needed

      Summary

      Provide way to prefer TCP for SRV lookup

      Background & Motivation

      DNS resolution is expected to first try with UDP, then retry with TCP if the UDP response indicates truncation.

      HELP-59749 notes a case where a customer observed a subset of SRV records returned in the UDP response, but the truncation flag (TC bit) was not set:

      TCP fallback does not work on their DNS records because the DNS server does not support TC bit in the response header.

      As a result, a changing subset of SRV records was applied each time SRV records are polled. I expect this results in repeated closing/opening of connections as servers are removed/added.

      DNS and Truncation in UDP suggests this may not be isolated to the customer:

      some 72,000 cases (91% of all such cases) where the resolver appears to be using truncated DNS response data occur for users located in just three networks, all located in China.

      Proposal: add way to opt-in to using TCP to resolve SRV records first (rather than on retry). Consider adding a URI option: srvPreferTCP.

      Alternatives

      Using TCP initally by default is another option. RFC-7766 notes:

      TCP ought to be considered a valid alternative transport to UDP, not purely a fallback option.

      But also describes possible disadvantages in Appendix A.

      Testing

      To observe TCP-retry behavior, use Wireshark to capture DNS. In my case, I disabled CloudFlare WARP to disable DNS-over-HTTPS and ran the following Python:

      from pymongo import MongoClient
      client = MongoClient("mongodb+srv://test1.kevinalbs.com")
      

      There were 30 SRV records for _mongodb._tcp.test1.kevinalbs.com. This resulted in the UDP response being truncated. In my case, the TC bit is (expectedly) set and the TCP retry occurs:

      I have not reliably reproduced the issue in HELP-59749 (UDP response is truncated, but TC bit not set).

      How does this affect the end user?

      In the case of HELP-59749, a changing subset of SRV records was applied each time SRV records are polled. I expect this results in repeated closing/opening of connections as servers are removed/added.

      How likely is it that this problem or use case will occur?

      This occurred in HELP-59749. I expect this impacts multiple drivers (PyMongo, Go, Rust, C, all queried with UDP first).

      DNS and Truncation in UDP suggests this may not be isolated to the customer. However, the article suggests this impacts a small percentage of DNS environments.

      If the problem does occur, what are the consequences and how severe are they?

      In the case of HELP-59749, a changing subset of SRV records was applied each time SRV records are polled. SRV records had a TTL of one minute. I expect this results in repeated closing/opening of connections as servers are removed/added.

      The truncated records result in less mongos servers being available for the driver to use. In the case of HELP-59729, 9 mongos servers were expected, 6 were applied due to truncation.

      Is this issue urgent?

      No? HELP-59749 is urgent, but a C-driver-specific solution was made in CDRIVER-5589.

      Acceptance Criteria

      When implemented (and enabled), SRV records will be queried with TCP.

            Assignee:
            Unassigned Unassigned
            Reporter:
            kevin.albertson@mongodb.com Kevin Albertson
            Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: