[CSHARP-4375] Connection pruner is too aggressive Created: 21/Oct/22  Updated: 27/Oct/23  Resolved: 16/Nov/22

Status: Closed
Project: C# Driver
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Hans Olav Loftum Assignee: Boris Dogadov
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Case:

 Description   

We recently updated from mongodb server 4.2 / mongodb driver 2.12.5 to mongodb server 5.0 / mongodb driver 2.17.0, after which we have observed a rise in timeouts from our applications, and a sawtooth-shape in our open connections graph. (We use an EventSubscriber to keep track of open connections.)

After some investigation, we see that there is a significant change in how connections are maintained in the ExclusiveConnectionPool.

v2.12.5 would prune 1 connection at a time (note the break statement in the loop):
https://github.com/mongodb/mongo-csharp-driver/blob/v2.12.5/src/MongoDB.Driver.Core/Core/ConnectionPools/ExclusiveConnectionPool.cs#L818

whilst newer versions will quite aggressively remove all expired connections:

https://github.com/mongodb/mongo-csharp-driver/blob/v2.18.0/src/MongoDB.Driver.Core/Core/ConnectionPools/ExclusiveConnectionPool.Helpers.cs#L707

 

This is a problem for our applications, which require many open connections (100) at all time. (Our applications handle high spikes of traffic, with quiet periods in between.)

During quiet periods, MaintenanceHelper (v2.17.0) will periodically remove all existing connections, and spend some time creating new ones. When a traffic spike hits our application at the same time, the application will not be able to serve incoming requests, and we get lots of timeouts.

 

v2.12.5 of the driver would remove one connection, open one connection, remove one, open one etc, which results in the application always having available open connections.

v2.17.0 removes all connections at once, which renders our application useless for the next (15-30) seconds.

 

We rely on MaintenanceHelper maintaining connections "the old way", or we cannot use the driver.



 Comments   
Comment by PM Bot [ 16/Nov/22 ]

There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to comment on this if you're able to provide more information.

Comment by Boris Dogadov [ 31/Oct/22 ]

Hi jean.coulon@mongodb.com 

Yes, I would suggest experimenting with values of 10, 20, 100 (or even higher if needed)
This parameter limits the number of simultaneously created connections by each client to limit the load on the server.
Therefore the optimal value depends on the total number of clients and the amount of connections that server can establish simultaneously. 

If that does not work, it would be helpful to get a repro simulating the load and validating the expected response time.
For example repro the initiates 1000 simultaneous requests that take 1 min on older version, and timeout on the new version.

Comment by Boris Dogadov [ 24/Oct/22 ]

Hello andreas.knudsen@nrk.no and hans.olav.loftum@nrk.no 

Thank you for identifying the specific driver version range.
One reason for such behavior differences might be the introduction of maxConnecting functionality. Default value of 2 resulted in behavior change under load for some applications, like here.
I would recommend fine-tuning this parameter for your application needs.

If this does not help, it would be great to have a repro demonstrating performance degradation in a certain scenario if possible. As you have mentioned the provided test application outputs the connections, but it does not replicate the traffic load issue.

Some notes:
The logic of connection pool populating and pruning is shared among all the drivers and is described here
We can evaluate optimizations in this area, but any improvement must be considered against multiple scenarios for various drivers.

Comment by Andreas Knudsen [ 24/Oct/22 ]

I have just run a few load tests with a modified ExclusiveConnectionPool.Helpers.cs that would start pruning connections while they still had half their lifetime/idle-time left and would never prune more than 25% at a time

I'm sad to say that it didn't have any noticeable effect on the system under heavy load,

I ran tests multiple times both with and without event subscription but I could not notice any difference between those configurations.

I did however notice that the CPU spiked quite a bit on the DB server and the connection count went down (observed with the Real Time metrics in Atlas) 

Can a saturated server be the answer to why the connection count is plummeting? 

I might try to downgrade the DB to MongoDB 4.2 tomorrow and run a new barrage of tests against it to see if we get the same CPU spikes on 4.2

 

Comment by Hans Olav Loftum [ 24/Oct/22 ]

@Dmitry Lukyanov

One of the criterions Andreas is referring to is this:
https://github.com/nrkno/mongo-csharp-driver/blob/master/src/MongoDB.Driver.Core/Core/Connections/BinaryConnection.cs#L113

BinaryConnection is "just as" expired whether it has been idle for too long, or if it has exceeded max lifetime.
But there is no way to tell that "this connection is getting old, and should be pensioned"

Imagine that an IConnection could get "old" after idling for some time - say MaxIdleTime * 0.75. Then it would be a candidate for pruning, even though it is not yet expired. That way, pruning/replacing it would start before the connection is useless.

Comment by Andreas Knudsen [ 24/Oct/22 ]

We have reproduced the problems in 2.13.3 and 2.15 so it was introduced between 2.12.5 and 2.13.3

Comment by Andreas Knudsen [ 24/Oct/22 ]

I agree that IF a connection has become unusable THEN it should be removed as fast as possible. My point is to introduce a new even stricter criterion used for pruning.

with the current setup there is no way to preemptively stock up on new connections until AFTER they have already become unusable..

If this was a refridgerator, would you not start to buy new milk BEFORE the expiry date? otherwise there WOULD GUARANTEED be a morning where there was no milk for the cereal

-A

 

Comment by Dmitry Lukyanov (Inactive) [ 24/Oct/22 ]

Thanks hans.olav.loftum@nrk.no , we will look at your repro.

to connect to mongo with the well-known

Yes, a timeout exception is well-known, but exception type itself says not much, we need to get servers statuses including heartbeat exceptions. So, we need full timeout exceptions descriptions (ie that triggered not because CancellationToken).

and requests will queue up, leading to high memory, CPU usage and thread exhaustion.

this looks unexpected.., we will try to get this behavior from your repro.

Hey andreas.knudsen@nrk.no ,

to avoid ending up with very few connections available for a sudden burst of traffic

if a connection is expired (ie perished) no one can use it, so at this point it looks unlikely that removing them at once can make any difference.

Do not use the same criterion for pruning as for discarding connections during Aquire

a connection can become expired in different ways, for example it became expired if the current server is unhealthy, so I don't think that introducing a new criteria can help with your issue.

For now, we will look at your repro to reproduce your issue and we'll get back to you.

Some initial suspicions that you can check to speed up investigating:

1. It might be related to how you use EventsLogger. It would be good if you can validate this issue without the event subscriber configured.
2. It might be related to pool changes introduced in 2.16 released. So, It would be good if you can check your application behavior with pre 2.16 driver, like 2.15.1

Comment by Andreas Knudsen [ 24/Oct/22 ]

Hi there. 

I believe I know what is happening here: 

There are 2 ways expired (both maxidletime and maxlifetime) connections are removed from the connection pool: 

  • ExclusiveConnectionPool.ListConnectionHolder.Prune(..)
    • Prunes all 100% of expired connections every MaintenanceInterval
    • After it is done with pruning it will then rebuild connections back up to minconnections
  • ExclusiveConnectionPool.ListConnectionHolder.Aquire(..)
    • Prunes all expired connections it discovers while looking for a new connection to aquire 

Unfortunately both of these mechanisms use the same criterion to determine if a connection should be discarded.

There are two problems with this setup:

  • Pruning connections is liable to some times take the connection count all the way down to 0 connections when under heavy load (since it is possible that all connections will have expired at some time during the maintenanceinterval)
    • If the maintenance interval is long and the maxlifetime is short
  • the pruning process risks not having anything to prune because by the time it starts the Aquire-process might have removed all the candidates for pruning preemptively
    • More likely if the maintenance interval is long 
    • If this happens then the connection count may also very well for a short while go to 0 since the Aquire process doesn't do anything special to prevent it.

I propose the following 2 changes: 

1) Do not use the same criterion for pruning as for discarding connections during Aquire

  • A simple way to do this is to say that connections become available for pruning after maxidletime/2 and maxlifetime/2
  • a more flexible way is to introduce a separate config value for this

2) Do not prune every connection that can be pruned in one go (to avoid ending up with very few connections available for a sudden burst of traffic)

  • A simple way to do this is to max prune Math.Max(1, _settings.MinConnections / 4) at a time before refilling connections
    • Ensuring that you always have at least 75% capacity available
  • A more flexible way is to introduce separate config value for this

 

I will submit a PR with these two changes made (both done the "simple" way as I'm not sure having the extra flexibility adds much value)

-Andreas

Comment by Hans Olav Loftum [ 24/Oct/22 ]

And yes: Our application uses MongoClient as a single instance.

Comment by Hans Olav Loftum [ 24/Oct/22 ]

Here are a couple of the stacktraces we got.

Our application uses an internal timeout, which will cancel after a certain amount of time. If we don't do that, all incoming requests will spend 15 seconds trying to connect to mongo with the well-known
A timeout occured after 15000ms selecting a server using CompositeServerSelector
and requests will queue up, leading to high memory, CPU usage and thread exhaustion.

Therefore, all exceptions have the message "The operation was canceled".

(Note: Our application has run like this with mongodb 4.2 and mongo driver 2.12.5 for a long time, without big connection issues.)

1.

The operation was canceled.
at System.Threading.CancellationToken.ThrowOperationCanceledException()
   at MongoDB.Driver.Core.Clusters.Cluster.WaitForDescriptionChangedHelper.HandleCompletedTask(Task completedTask)
   at MongoDB.Driver.Core.Clusters.Cluster.WaitForDescriptionChangedAsync(IServerSelector selector, ClusterDescription description, Task descriptionChangedTask, TimeSpan timeout, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Clusters.Cluster.SelectServerAsync(IServerSelector selector, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Clusters.IClusterExtensions.SelectServerAndPinIfNeededAsync(ICluster cluster, ICoreSessionHandle session, IServerSelector selector, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Bindings.ReadPreferenceBinding.GetReadChannelSourceAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.RetryableReadContext.InitializeAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.RetryableReadContext.CreateAsync(IReadBinding binding, Boolean retryRequested, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.FindOperation`1.ExecuteAsync(IReadBinding binding, CancellationToken cancellationToken)
   at MongoDB.Driver.OperationExecutor.ExecuteReadOperationAsync[TResult](IReadBinding binding, IReadOperation`1 operation, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.ExecuteReadOperationAsync[TResult](IClientSessionHandle session, IReadOperation`1 operation, ReadPreference readPreference, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.UsingImplicitSessionAsync[TResult](Func`2 funcAsync, CancellationToken cancellationToken)
   at Our code

2.
The operation was canceled.
at System.Threading.CancellationToken.ThrowOperationCanceledException()
   at MongoDB.Driver.Core.Misc.SemaphoreSlimSignalable.IsSignaled(CancellationTokenSource signalTokenSource, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Misc.SemaphoreSlimSignalable.WaitSignaledAsync(TimeSpan timeout, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.ConnectionCreator.CreateOpenedOrReuseAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.AcquireConnectionHelper.AcquireConnectionAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.AcquireConnectionAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Servers.Server.GetChannelAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.RetryableReadContext.InitializeAsync(CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.RetryableReadContext.CreateAsync(IReadBinding binding, Boolean retryRequested, CancellationToken cancellationToken)
   at MongoDB.Driver.Core.Operations.FindOperation`1.ExecuteAsync(IReadBinding binding, CancellationToken cancellationToken)
   at MongoDB.Driver.OperationExecutor.ExecuteReadOperationAsync[TResult](IReadBinding binding, IReadOperation`1 operation, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.ExecuteReadOperationAsync[TResult](IClientSessionHandle session, IReadOperation`1 operation, ReadPreference readPreference, CancellationToken cancellationToken)
   at MongoDB.Driver.MongoCollectionImpl`1.UsingImplicitSessionAsync[TResult](Func`2 funcAsync, CancellationToken cancellationToken)
   at Our code

Comment by Hans Olav Loftum [ 24/Oct/22 ]

Test application here:
https://github.com/nrkno/mongo-connection-test

It is a console app that just idles, and prints number of connections to console.

Comment by Dmitry Lukyanov (Inactive) [ 21/Oct/22 ]

Hey hans.olav.loftum@nrk.no , thank you for your report.

We're investigating your case. Meanwhile can you provide/try the following:

  1. "we get lots of timeouts." - can you provide these exceptions with stacktrace?
  2. It would be good to get a repro if possible, this will help a lot with investigating. Also, can you provide your use case, which operations fail, whether threads are involved and etc.
  3. Does this behavior happen constantly under load? Has it ever happened without load?
  4. You mentioned you configure EventSubscriber, can you provide its code and in general which options you specify during creating MongoClient? Can you check this issue without EventSubscriber configured?
  5.  Do you use MongoClient as singleton?
  6. Also, it would be good if you can validate this behavior in pre-2.16 driver, like in 2.15.1.

Thanks in advance.

Generated at Wed Feb 07 21:48:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.