[CSHARP-4375] Connection pruner is too aggressive Created: 21/Oct/22 Updated: 27/Oct/23 Resolved: 16/Nov/22 |
|
| Status: | Closed |
| Project: | C# Driver |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Hans Olav Loftum | Assignee: | Boris Dogadov |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Case: | (copied to CRM) |
| Description |
|
We recently updated from mongodb server 4.2 / mongodb driver 2.12.5 to mongodb server 5.0 / mongodb driver 2.17.0, after which we have observed a rise in timeouts from our applications, and a sawtooth-shape in our open connections graph. (We use an EventSubscriber to keep track of open connections.) After some investigation, we see that there is a significant change in how connections are maintained in the ExclusiveConnectionPool. v2.12.5 would prune 1 connection at a time (note the break statement in the loop): whilst newer versions will quite aggressively remove all expired connections:
This is a problem for our applications, which require many open connections (100) at all time. (Our applications handle high spikes of traffic, with quiet periods in between.) During quiet periods, MaintenanceHelper (v2.17.0) will periodically remove all existing connections, and spend some time creating new ones. When a traffic spike hits our application at the same time, the application will not be able to serve incoming requests, and we get lots of timeouts.
v2.12.5 of the driver would remove one connection, open one connection, remove one, open one etc, which results in the application always having available open connections. v2.17.0 removes all connections at once, which renders our application useless for the next (15-30) seconds.
We rely on MaintenanceHelper maintaining connections "the old way", or we cannot use the driver. |
| Comments |
| Comment by PM Bot [ 16/Nov/22 ] |
|
There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to comment on this if you're able to provide more information. |
| Comment by Boris Dogadov [ 31/Oct/22 ] |
|
Yes, I would suggest experimenting with values of 10, 20, 100 (or even higher if needed) If that does not work, it would be helpful to get a repro simulating the load and validating the expected response time. |
| Comment by Boris Dogadov [ 24/Oct/22 ] |
|
Hello andreas.knudsen@nrk.no and hans.olav.loftum@nrk.no Thank you for identifying the specific driver version range. If this does not help, it would be great to have a repro demonstrating performance degradation in a certain scenario if possible. As you have mentioned the provided test application outputs the connections, but it does not replicate the traffic load issue. Some notes: |
| Comment by Andreas Knudsen [ 24/Oct/22 ] |
|
I have just run a few load tests with a modified ExclusiveConnectionPool.Helpers.cs that would start pruning connections while they still had half their lifetime/idle-time left and would never prune more than 25% at a time I'm sad to say that it didn't have any noticeable effect on the system under heavy load, I ran tests multiple times both with and without event subscription but I could not notice any difference between those configurations. I did however notice that the CPU spiked quite a bit on the DB server and the connection count went down (observed with the Real Time metrics in Atlas) Can a saturated server be the answer to why the connection count is plummeting? I might try to downgrade the DB to MongoDB 4.2 tomorrow and run a new barrage of tests against it to see if we get the same CPU spikes on 4.2
|
| Comment by Hans Olav Loftum [ 24/Oct/22 ] |
|
@Dmitry Lukyanov One of the criterions Andreas is referring to is this: BinaryConnection is "just as" expired whether it has been idle for too long, or if it has exceeded max lifetime. Imagine that an IConnection could get "old" after idling for some time - say MaxIdleTime * 0.75. Then it would be a candidate for pruning, even though it is not yet expired. That way, pruning/replacing it would start before the connection is useless. |
| Comment by Andreas Knudsen [ 24/Oct/22 ] |
|
We have reproduced the problems in 2.13.3 and 2.15 so it was introduced between 2.12.5 and 2.13.3 |
| Comment by Andreas Knudsen [ 24/Oct/22 ] |
|
I agree that IF a connection has become unusable THEN it should be removed as fast as possible. My point is to introduce a new even stricter criterion used for pruning. with the current setup there is no way to preemptively stock up on new connections until AFTER they have already become unusable.. If this was a refridgerator, would you not start to buy new milk BEFORE the expiry date? otherwise there WOULD GUARANTEED be a morning where there was no milk for the cereal -A
|
| Comment by Dmitry Lukyanov (Inactive) [ 24/Oct/22 ] |
|
Thanks hans.olav.loftum@nrk.no , we will look at your repro.
Yes, a timeout exception is well-known, but exception type itself says not much, we need to get servers statuses including heartbeat exceptions. So, we need full timeout exceptions descriptions (ie that triggered not because CancellationToken).
this looks unexpected.., we will try to get this behavior from your repro. Hey andreas.knudsen@nrk.no ,
if a connection is expired (ie perished) no one can use it, so at this point it looks unlikely that removing them at once can make any difference.
a connection can become expired in different ways, for example it became expired if the current server is unhealthy, so I don't think that introducing a new criteria can help with your issue. For now, we will look at your repro to reproduce your issue and we'll get back to you. Some initial suspicions that you can check to speed up investigating: 1. It might be related to how you use EventsLogger. It would be good if you can validate this issue without the event subscriber configured. |
| Comment by Andreas Knudsen [ 24/Oct/22 ] |
|
Hi there. I believe I know what is happening here: There are 2 ways expired (both maxidletime and maxlifetime) connections are removed from the connection pool:
Unfortunately both of these mechanisms use the same criterion to determine if a connection should be discarded. There are two problems with this setup:
I propose the following 2 changes: 1) Do not use the same criterion for pruning as for discarding connections during Aquire
2) Do not prune every connection that can be pruned in one go (to avoid ending up with very few connections available for a sudden burst of traffic)
I will submit a PR with these two changes made (both done the "simple" way as I'm not sure having the extra flexibility adds much value) -Andreas |
| Comment by Hans Olav Loftum [ 24/Oct/22 ] |
|
And yes: Our application uses MongoClient as a single instance. |
| Comment by Hans Olav Loftum [ 24/Oct/22 ] |
|
Here are a couple of the stacktraces we got. Our application uses an internal timeout, which will cancel after a certain amount of time. If we don't do that, all incoming requests will spend 15 seconds trying to connect to mongo with the well-known Therefore, all exceptions have the message "The operation was canceled". (Note: Our application has run like this with mongodb 4.2 and mongo driver 2.12.5 for a long time, without big connection issues.) 1. The operation was canceled. 2. |
| Comment by Hans Olav Loftum [ 24/Oct/22 ] |
|
Test application here: It is a console app that just idles, and prints number of connections to console. |
| Comment by Dmitry Lukyanov (Inactive) [ 21/Oct/22 ] |
|
Hey hans.olav.loftum@nrk.no , thank you for your report. We're investigating your case. Meanwhile can you provide/try the following:
Thanks in advance. |