[SERVER-31251] Increase backlog specified in listen() Created: 25/Sep/17  Updated: 08/Jan/24

Status: Open
Project: Core Server
Component/s: Networking
Affects Version/s: None
Fix Version/s: features we're not sure of

Type: Improvement Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 4
Labels: features-not-sure-of
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File backlog.png     PNG File spike.png    
Issue Links:
Depends
Related
related to SERVER-31400 Record Linux netstat metrics in ftdc Closed
is related to SERVER-2554 Make listen backlog configurable Closed
Assigned Teams:
Service Arch
Sprint: Service Arch 2018-11-05, Service Arch 2018-11-19, Service Arch 2018-12-03
Participants:
Case:

 Description   

A small value for the backlog parameter to listen() can result in very high latencies as seen by the client when an increase in application load or a decrease in db performance requires the client to create a large number of connections nearly simultaneously (i.e. a "connections spike").

When a connection spike occurs, mongod may not be able to accept() the connections as quickly as they are created. When the number of connections that have been created but not accepted by mongod exceeds the backlog, the SYN packets to establish the connections are ignored. The ignored clients then retry the SYN packets 1 second later, creating another smaller spike. If this spike also exceeds the backlog the result will be more ignored SYN packets, and another spike of retries 2 seconds later, and so on, with the clients doubling the backoff time on each attempt. Since the backoff is exponential, some unlucky connections can wait a very long time - tens of seconds - resulting in extreme operation latencies as seen by the client.

This can be seen in the following two runs of the same client starts 5000 threads creating a connection spike of 5000 connections. The first run is with a backlog setting of 10000, while the second is with the default backlog of 128.

At the default backlog setting on the right the network stats report a large number of "listen drops" and this results in the series of connection spikes of decreasing sizes and extreme latencies (as seen by the client) of ~64 s ("slowest query durations") that we see here.

With the larger backlog setting on the left there is no exponential backoff - connections are queued by the kernel for mongod to accept at the rate it can, eliminating the extreme latency outliers.

3.6 introduces a parameter "--listenBacklog" to allow setting the backlog. However there are two problems with this:

  • the default is SOMAXCONN (which is 128)
  • it isn't sufficient just to increase --listenBacklog, as the value is silently truncated at the value of the kernel parameter net.core.somaxconn, which by default is also 128, so the user must also increase this.

Rather than requiring the user to change two parameters, it would seem more straightforward for mongod to specify a large value in listen() by default, allowing the customer to control the backlog just by changing net.core.somaxconn.



 Comments   
Comment by Bruce Lucas (Inactive) [ 28/Nov/18 ]

The observed behavior for status quo was different from that: the kernel simply drops the SYN packets when the backlog fills up, causing the client (at the network driver level) to go into an exponential backoff, the result of which is that some of the connections take a very long time to establish, but clients do not see ECONNREFUSED. An application may time out the connection attempt and retry before the connection is eventually established, but I think that just puts us at step 4 in your second scenario, so I'm not sure there's a difference between the scenarios if establishing the connections is slow enough that apps time out on establishing connections.  But in the case where mongod is able to accept the connections before the app begins to time out the connection attempts, then having a large backlog allows that to happen without the extreme latencies due to the exponential backoff.

Comment by Mira Carey [ 28/Nov/18 ]

luke.prochazka - there are more issues around connection establishment than just the listen backlog. Changing that without addressing those other areas could easily make connection storms worse, rather than better.

Consider two scenarios, one with a small list backlog, one with a large one -

Status Quo

  1. Many client applications open connections to a host at once
  2. mongod can't accept them fast enough
  3. Listen backlog fills
  4. Clients begin receiving ECONNREFUSED
  5. Clients attempt new connections, those connections fail, eventually assume the host is down and potentially route traffic somewhere else

It's not great, and for primary required workloads the "route somewhere else" case may not be possible without a new election (which may trigger the whole problem again).

with a deep listen backlog

  1. Many client applications open connections to a host at once
  2. mongod can't accept them fast enough
  3. Listen backlog doesn't fill
  4. Clients begin timing out sockets (connection timeout, socket timeout, overall time to connect/tls/auth, etc)
  5. Server spins up threads to handle these sockets, does some work before the clients gives up (perhaps far too much work)
  6. Clients attempt new connections, those connections fail, eventually assume the host is down and potentially route traffic somewhere else

The end result isn't that much different, but (for say a 64k backlog) we now have the potential to do 64k worth of wasted work, instead of 128.

Things I'd want to do before increasing the listen backlog:

  • Change the server to treat socket disconnection the same as killop for a running op (we currently finish any work we start for an operation, only noticing the clients gone away on send()) - this is in flight
  • Change drivers to treat connection failure as a rate limiting operation (this would be part of request backpressure)
  • Potentially re-write ingress client accept to make it multi-threaded (as it is, if we're hitting the current listen backlog, our single threaded acceptor is going to take multiple seconds to get around to your socket with a deeper one) (not planned)
  • Offer some kind of qos on which clients we prioritize actually running operations for (a world where allow more simultaneous connects will need hard queuing for running ops) (not planned)
Comment by Mira Carey [ 16/Nov/18 ]

I remain concerned that extending the size of the listen backlog is a poor bandaid for what we actually need (an ability to prevent clients from opening too many connections, and from interpreting connection failure as a signal to open many new connections).

As such, I'm lodging this in under the request backpressure scope, hoping that we can validate the use case it would solve via another channel

Comment by Andrew Morrow (Inactive) [ 31/Oct/18 ]

mira.carey@mongodb.com - I'm handing this ticket over to you, as it seems connected to the information that ben.caimano was discussing about listener starvation.

Comment by Andrew Morrow (Inactive) [ 08/Mar/18 ]

It is not safe to raise the default above SOMAXCONN. Instead, we plan in SERVER-31400 to add improved network diagnostic data capture so that scenarios that would safely benefit from a listen backlog higher than SOMAXCONN can be more easily identified and addressed.

Comment by Bruce Lucas (Inactive) [ 26/Sep/17 ]

I tried various settings of net.ipv4.tcp_max_syn_backlog (500, the default 2048, 10000) in conjunction with a net.core.somaxconn of 10000, and did not find any difference in the behavior of the test application that creates a spike of 5000 connections. Since this is just the queue of connections awaiting completion of the 3-way handshake, I suspect the default setting of 2048 is ok.

Comment by Bruce Lucas (Inactive) [ 26/Sep/17 ]

I tested setting tcp_abort_on_overflow instead of increasing the backlog, but that results in "connection reset" errors being surfaced to the application, so I don't think that's a suitable substitute.

Generated at Thu Feb 08 04:26:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.