Loading...

Type: Improvement
Resolution: Unresolved
Priority: Unknown
Fix Version/s: None
Component/s: CMAP, FaaS
Labels:
- faas

Driver Changes:
Needed
Case:

Summary

By default, drivers currently do not limit connection idle time for pooled connections. Drivers using the default configuration will never close idle connections, even if they are idle for long periods of time. For applications with variable demand for connections, that means driver connection pools may grow as demand increases but never shrink when demand reduces. As a result, client-side and server-side TCP connection resources are never released, even when they are effectively unused.

We should amend the CMAP specification to set a default maxIdleTimeMS of 1800000 (30 minutes). To prevent connection churn when maintaining minPoolSize, we should also amend the definition of Idle to the following:

Idle: The Connection is currently "available" (as defined below), has been "available" for longer than maxIdleTimeMS, and the number of total connections in the pool is greater than minPoolSize.

References for picking a reasonable default idle timeout:

The Go HTTP client "net/http".DefaultTransport uses a default idle connection timeout of 90 seconds.
The Go SQL client "database/sql".DB has no default idle connection timeout, but uses a default limit of 2 idle connections.
The OpenJDK 11 HttpClient ConnectionPool uses a default idle connection timeout of 20 minutes.
MongoDB cursors have a default server-side timeout of 10 minutes.
MongoDB sessions have a default server-side timeout of 30 minutes.

Note that creating a new MongoDB connection is typically more expensive than creating a new HTTP connection, so the risk of closing connections too eagerly is higher. We choose a default idle timeout of 30 minutes because it is is equal or greater to all of the reference network client timeouts and server-side state timeouts.

Motivation

Who is the affected end user?

Anyone who uses a driver with the default maxIdleTimeMS setting.

How does this affect the end user?

The driver may hold effectively unused TCP connection resources indefinitely.

How likely is it that this problem or use case will occur?

The problem will occur any time a customer uses the default maxIdleTimeMS in an application that has variable demand for connections.

For example, if a customer runs a web service that experiences spikes in request throughput, the driver may create new connections to handle the corresponding increase in operation throughput. When the request throughput returns to normal, some driver connections may remain idle for long periods of time (depending on the frequency of request throughput spikes). If the driver is configured with the default maxIdleTimeMS, it will never close those now-idle connections.

If the problem does occur, what are the consequences and how severe are they?

There are a few consequences:

Increased memory and network resource usage (client-side and server-side TCP handles, TCP keepalive pings for idle connections, maybe additional server-side resources?).
Reduced buffer between the number of open connections and the maximum client-side and server-side connection limits.
- In practice, this is probably more significant for server-side connection limits, especially for low-tier Atlas clusters which tend to have a low per-node connection limit.
  For example, consider a customer who runs 10 instances of a web application connected to an Atlas M10 3-node replicaset (max 1,500 connections per node). Under normal request throughput, each web application instance opens 15 connections to the primary (150 connections total). The buffer between the current and maximum connection limit is 1,350 connections. A large spike in request throughput causes the driver on each web application instance to create the default maximum of 100 connections to the primary (1,000 connections total). The buffer between the current and maximum connection limit is now only 500 connections and will not recover (using the default maxIdleTimeMS) until the web application instances are restarted.

Is this issue urgent?

No.

Is this ticket required by a downstream team?

No.

Is this ticket only for tests?

No.

is related to

GODRIVER-1560 default connection idleTimeout is 10 minutes not unlimited

Closed

GODRIVER-2168 Prevent unnecessary connection churn and timeout idle connections

Closed

Details

Description

Summary

Motivation

Who is the affected end user?

How does this affect the end user?

How likely is it that this problem or use case will occur?

If the problem does occur, what are the consequences and how severe are they?

Is this issue urgent?

Is this ticket required by a downstream team?

Is this ticket only for tests?

Attachments

Issue Links

Forms

Activity

People

Dates