[SERVER-31251] Increase backlog specified in listen() Created: 25/Sep/17 Updated: 08/Jan/24 |
|
| Status: | Open |
| Project: | Core Server |
| Component/s: | Networking |
| Affects Version/s: | None |
| Fix Version/s: | features we're not sure of |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Backlog - Service Architecture |
| Resolution: | Unresolved | Votes: | 4 |
| Labels: | features-not-sure-of | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Service Arch
|
||||||||||||||||
| Sprint: | Service Arch 2018-11-05, Service Arch 2018-11-19, Service Arch 2018-12-03 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Description |
|
A small value for the backlog parameter to listen() can result in very high latencies as seen by the client when an increase in application load or a decrease in db performance requires the client to create a large number of connections nearly simultaneously (i.e. a "connections spike"). When a connection spike occurs, mongod may not be able to accept() the connections as quickly as they are created. When the number of connections that have been created but not accepted by mongod exceeds the backlog, the SYN packets to establish the connections are ignored. The ignored clients then retry the SYN packets 1 second later, creating another smaller spike. If this spike also exceeds the backlog the result will be more ignored SYN packets, and another spike of retries 2 seconds later, and so on, with the clients doubling the backoff time on each attempt. Since the backoff is exponential, some unlucky connections can wait a very long time - tens of seconds - resulting in extreme operation latencies as seen by the client. This can be seen in the following two runs of the same client starts 5000 threads creating a connection spike of 5000 connections. The first run is with a backlog setting of 10000, while the second is with the default backlog of 128.
At the default backlog setting on the right the network stats report a large number of "listen drops" and this results in the series of connection spikes of decreasing sizes and extreme latencies (as seen by the client) of ~64 s ("slowest query durations") that we see here. With the larger backlog setting on the left there is no exponential backoff - connections are queued by the kernel for mongod to accept at the rate it can, eliminating the extreme latency outliers. 3.6 introduces a parameter "--listenBacklog" to allow setting the backlog. However there are two problems with this:
Rather than requiring the user to change two parameters, it would seem more straightforward for mongod to specify a large value in listen() by default, allowing the customer to control the backlog just by changing net.core.somaxconn. |
| Comments |
| Comment by Bruce Lucas (Inactive) [ 28/Nov/18 ] |
|
The observed behavior for status quo was different from that: the kernel simply drops the SYN packets when the backlog fills up, causing the client (at the network driver level) to go into an exponential backoff, the result of which is that some of the connections take a very long time to establish, but clients do not see ECONNREFUSED. An application may time out the connection attempt and retry before the connection is eventually established, but I think that just puts us at step 4 in your second scenario, so I'm not sure there's a difference between the scenarios if establishing the connections is slow enough that apps time out on establishing connections. But in the case where mongod is able to accept the connections before the app begins to time out the connection attempts, then having a large backlog allows that to happen without the extreme latencies due to the exponential backoff. |
| Comment by Mira Carey [ 28/Nov/18 ] |
|
luke.prochazka - there are more issues around connection establishment than just the listen backlog. Changing that without addressing those other areas could easily make connection storms worse, rather than better. Consider two scenarios, one with a small list backlog, one with a large one - Status Quo
It's not great, and for primary required workloads the "route somewhere else" case may not be possible without a new election (which may trigger the whole problem again). with a deep listen backlog
The end result isn't that much different, but (for say a 64k backlog) we now have the potential to do 64k worth of wasted work, instead of 128. Things I'd want to do before increasing the listen backlog:
|
| Comment by Mira Carey [ 16/Nov/18 ] |
|
I remain concerned that extending the size of the listen backlog is a poor bandaid for what we actually need (an ability to prevent clients from opening too many connections, and from interpreting connection failure as a signal to open many new connections). As such, I'm lodging this in under the request backpressure scope, hoping that we can validate the use case it would solve via another channel |
| Comment by Andrew Morrow (Inactive) [ 31/Oct/18 ] |
|
mira.carey@mongodb.com - I'm handing this ticket over to you, as it seems connected to the information that ben.caimano was discussing about listener starvation. |
| Comment by Andrew Morrow (Inactive) [ 08/Mar/18 ] |
|
It is not safe to raise the default above SOMAXCONN. Instead, we plan in |
| Comment by Bruce Lucas (Inactive) [ 26/Sep/17 ] |
|
I tried various settings of net.ipv4.tcp_max_syn_backlog (500, the default 2048, 10000) in conjunction with a net.core.somaxconn of 10000, and did not find any difference in the behavior of the test application that creates a spike of 5000 connections. Since this is just the queue of connections awaiting completion of the 3-way handshake, I suspect the default setting of 2048 is ok. |
| Comment by Bruce Lucas (Inactive) [ 26/Sep/17 ] |
|
I tested setting tcp_abort_on_overflow instead of increasing the backlog, but that results in "connection reset" errors being surfaced to the application, so I don't think that's a suitable substitute. |