[CSHARP-3885] [Unconfigurable] connection rate limiter in 2.13.x breaks existing applications Created: 30/Sep/21  Updated: 28/Oct/23  Resolved: 11/Jan/22

Status: Closed
Project: C# Driver
Component/s: Connectivity
Affects Version/s: None
Fix Version/s: 2.14.0

Type: Bug Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Boris Dogadov
Resolution: Fixed Votes: 3
Labels: connections, pooling
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2021-11-05-15-40-01-037.png    
Issue Links:
Depends
depends on DRIVERS-1943 Make maxConnecting configurable Implementing
Problem/Incident
is caused by CSHARP-3305 Rate limit new connection creations (... Closed

 Description   

Hello!

The resolution for the CSHARP-3305 issue introduced the limit on simultaneous connection attempts. Unfortunately, this breaks some existing applications when upgrading to 2.13.x, if they operate under considerable load with high concurrency, especially when using co-hosted mongos instance.
I surely understand that you would like to prevent overloading the server with a lot of connections. We even implemented similar measures by introducing a delay in IEventSubscriber.ConnectionPoolAddingConnectionEvent handler. Unfortunately, the apparent lack of ability to disable it or configure the MongoInternalDefaults.ConnectionPool.MaxConnecting blocks the upgrade to 2.13.x for us quite possibly may affect other users.

I would like to ask you to expose the MaxConnecting parameter so driver users can tune it to their liking and disable it.

Thank you,
Aristarkh Zagorodnikov



 Comments   
Comment by Boris Dogadov [ 11/Jan/22 ]

Fixed in CSHARP-3952

Comment by Boris Dogadov [ 11/Jan/22 ]

Thanks onyxmaster, closing.

Comment by Aristarkh Zagorodnikov [ 10/Jan/22 ]

Since CSHARP-3952 was implemented, maybe this one should be closed?

Comment by Gian Maria Ricci [ 08/Nov/21 ]

Thanks a lot, we reverted the driver and can confirm that with the latest version (12.4) application come back at full speed.
I'll wait for the resolution of this issue before upgrading to new version.

 

Thanks.

Comment by Boris Dogadov [ 08/Nov/21 ]

Hi alkampfer, thank you for your report.
We are working on a fix, please follow this ticket for further updates.

Comment by Gian Maria Ricci [ 05/Nov/21 ]

Actually a large application I'm working into was completely wasted upgrading from 2.12 to 2.13.2 version of the driver. We have critical paths where lots of threads are upgrading a checkpoint collection. Randomly all those thread were blocked, and we experienced from our log that a single document upgrade took 10 seconds or more. 
Debugging the drivers we found that all threads are blocked in acquire method (see picture above, where I have this situation for almost 10 seconds). Changing MongoInternalDefaults.ConnectionPool.MaxConnecting value from 2 to some value like 100 seems to solve the problem.

I think that such impacting settings should left public to be changed by user of the driver, the risk of breaking existing code is enormous.

Comment by Boris Dogadov [ 01/Oct/21 ]

Thanks onyxmaster
We are looking into this issue, please follow this ticket for further updates.

Comment by Aristarkh Zagorodnikov [ 01/Oct/21 ]

Sure.
The service is a GridFS HTTP proxy upstream to co-hosted nginx and connects to a local mongos (4.4.9), which connects to the sharded cluster on the same network. The mongos instance has the taskExecutorPoolSize parameter set to 0 to prevent reconnection issues. The load is about 300 requests per second, but since they serve objects from GridFS, the request execution time is often considerable (hundreds of milliseconds, sometimes seconds). When we restart the service in question, there is, of course, a set of reconnections.
The service and mongos run on Ubuntu 20.04 with a 5.4 kernel on physical hardware, which has 2xE5-2630v2 CPUs and 128GB of RAM.
The connection settings are all the default ones, except the minimum pool size set to 100 and the maximum set to 1000.

Before the 2.13 (in fact, I manually bisected the 2.12->2.13 history and arrived at commit 59a1268d8c1fa905820d7789dd2f86b350dc7648, which is how I found out about CSHARP-3305), the reconnection worked reliably (we used 2.12.4). After upgrading to 2.13.1, the service restarts leads to a bunch of timeouts. Since the pressure doesn't lift because the requests keep arriving, it gets limited with the Kestrel (ASP.NET Core HTTP server) connection limiter. It often can't recover from the stalling, perhaps due to many tasks scheduled and a high GC load due to requests piled up. Also, in some cases, it leads to OOM exceptions since we've got COMPlus_GCHeapHardLimit=0x40000000 set. Raising it is not a problem, but before 2.13, we didn't have issues like that, so I'm reluctant to "fixing" this issue by just giving it more resources.

Comment by Boris Dogadov [ 30/Sep/21 ]

Thanks onyxmaster for your question.
Could you please provide more details for your scenario that is affected by MaxConnecting setting, as well as other pool settings and high concurrency characteristics, so we could better understand the need?

Thanks.

Comment by Aristarkh Zagorodnikov [ 30/Sep/21 ]

I do not think this should be a connection string-level parameter since it would practically invite people to set it incorrectly, but maybe add it to MongoDefaults as a settable property?

Generated at Wed Feb 07 21:46:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.