Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Networking & Observability
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The AsioTransportLayer currently uses asio's async_resolve function to perform DNS resolution. This function is actually implemented by running getaddrinfo synchronously on a single background thread per io_context (AsioReactor). This presents two problems:

DNS resolution happens one at a time, so a slow DNS response can block all connection establishment for a given NetworkInterface.
Once DNS resolution requests cannot be cancelled.

In the event that the DNS server is responding slowly for an extended period of time, the queue of DNS resolution requests can grow to be very large, leading to connection establishment timeouts and further DNS requests for reestablishment. Since DNS requests cannot be cancelled, this queue can keep growing and is only cleared as the background thread finishes processing the requests or if the process is restarted. This can make MongoDB processes take longer to recover from periods of slow/unavailable DNS servers, which is a risk to overall availability.

We should improve the server's DNS resolution to be able to cancel in-progress and queued requests so that it can quickly recover from periods of DNS instability. One option would be to adopt the c-ares async DNS resolution library, which is actually already in use by the gRPC transport layer.

Note that fixing this isn't a replacement for ensuring that the deployment has highly available and performant DNS infrastructure (e.g. by leveraging a DNS cache), as that is essential for ensuring MongoDB itself is available and performant in general, regardless of the issue identified in this ticket. As it pertains to this issue in particular, if the DNS infrastructure begins servicing requests quickly after a period of slowness/unavailability, the queue should be able to be cleared relatively quickly, even if it had already grown quite large.

See:

related to

SERVER-64876 Cancel any ASIO asynchronous resolution as part of a TransportLayerASIO shutdown

Backlog

Assignee:: Unassigned
Reporter:: Patrick Freed
Participants:: Patrick Freed
Votes:: 1 Vote for this issue
Watchers:: 11 Start watching this issue

Created:: Nov 22 2024 10:27:55 PM UTC
Updated:: Dec 26 2024 04:57:47 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates