Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-97572

More gracefully recover from DNS infrastructure instability

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Networking & Observability

      The AsioTransportLayer currently uses asio's async_resolve function to perform DNS resolution. This function is actually implemented by running getaddrinfo synchronously on a single background thread per io_context (AsioReactor). This presents two problems:

      • DNS resolution happens one at a time, so a slow DNS response can block all connection establishment for a given NetworkInterface.
      • Once DNS resolution requests cannot be cancelled.

      In the event that the DNS server is responding slowly for an extended period of time, the queue of DNS resolution requests can grow to be very large, leading to connection establishment timeouts and further DNS requests for reestablishment. Since DNS requests cannot be cancelled, this queue can keep growing and is only cleared as the background thread finishes processing the requests or if the process is restarted. This can make MongoDB processes take longer to recover from periods of slow/unavailable DNS servers, which is a risk to overall availability.

      We should improve the server's DNS resolution to be able to cancel in-progress and queued requests so that it can quickly recover from periods of DNS instability. One option would be to adopt the c-ares async DNS resolution library, which is actually already in use by the gRPC transport layer.

      Note that fixing this isn't a replacement for ensuring that the deployment has highly available and performant DNS infrastructure (e.g. by leveraging a DNS cache), as that is essential for ensuring MongoDB itself is available and performant in general, regardless of the issue identified in this ticket. As it pertains to this issue in particular, if the DNS infrastructure begins servicing requests quickly after a period of slowness/unavailability, the queue should be able to be cleared relatively quickly, even if it had already grown quite large.

      See:

            Assignee:
            Unassigned Unassigned
            Reporter:
            patrick.freed@mongodb.com Patrick Freed
            Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: