[CSHARP-1532] ConnectionPool death Created: 13/Jan/16  Updated: 20/Jan/16  Resolved: 14/Jan/16

Status: Closed
Project: C# Driver
Component/s: Connectivity
Affects Version/s: 2.2
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Bret Ferrier Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: c#, connection, driver
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

.net 4.5.2 Azure Websites


Issue Links:
Duplicate
duplicates CSHARP-1515 Queries using the Legacy API can leak... Closed

 Description   

So I have been having an intermittent issue where the app is no longer able to connect to the database. I first increased the number of available connections in the connection pool for a Medium size Azure Website (2gigs of memory) to 400 which decreased the frequency and mostly made the error go away. I opened a help ticket with Microsoft and was able to track the error down to running out of memory on one of 3 VMs that are running the site.

After looking over my code it appears that the error is being caused by the C# driver not recovering once the connection pool gets in a bad state. Whenever the traffic grows to my site new VMs will automatically be added by Azure and I often have about 6 VMs serving the site. When an error occurs connecting to Mongo it will only occur on one server and that server will become unresponsive and the error below is what shows up in the error logs. It appears that after some amount of time the ConnectionPool can get in a bad state and that it never recovers until the site is restarted.

Error Message:
System.TimeoutException
Timed out waiting for a connection after 0ms.

System.TimeoutException: Timed out waiting for a connection after 0ms.
at MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.AcquireConnectionHelper.EnteredPool(Boolean enteredPool)
at MongoDB.Driver.Core.ConnectionPools.ExclusiveConnectionPool.<AcquireConnectionAsync>d__35.MoveNext()
— End of stack trace from previous location where exception was thrown —
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.ConfiguredTaskAwaitable`1.ConfiguredTaskAwaiter.GetResult()
at MongoDB.Driver.Core.Servers.ClusterableServer.<GetChannelAsync>d__40.MoveNext()
— End of stack trace from previous location where exception was thrown —
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.ConfiguredTaskAwaitable`1.ConfiguredTaskAwaiter.GetResult()
at MongoDB.Driver.Core.Operations.FindOperation`1.<ExecuteAsync>d__107.MoveNext()
— End of stack trace from previous location where exception was thrown —
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at MongoDB.Driver.OperationExecutor.<ExecuteReadOperationAsync>d__1`1.MoveNext()
— End of stack trace from previous location where exception was thrown —
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at MongoDB.Driver.MongoCollectionImpl`1.<ExecuteReadOperationAsync>d__59`1.MoveNext()
— End of stack trace from previous location where exception was thrown —
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at MongoDB.Driver.IAsyncCursorSourceExtensions.<ToListAsync>d__16`1.MoveNext()
— End of stack trace from previous location where exception was thrown —
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at CitySpark.Common.Utilities.Retry.<TimesAsAwaitable>d_5`1.MoveNext()
— End of stack trace from previous location where exception was thrown —
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
at CitySpark.EventCache.FlatEventRepo.<GetEventsPagedAsync>d__14.MoveNext()



 Comments   
Comment by ravindra [ 20/Jan/16 ]

Yes I did upgrade during last week of December and deployed on the servers last Friday and hence the issue. Thanks for the quick patch.

Comment by Bret Ferrier [ 20/Jan/16 ]

This is also a regression in the newer drivers it seems that weren't out 6 months ago. So at some point you must have upgraded your drivers.

Comment by Craig Wilson [ 19/Jan/16 ]

Presumably, your load got a little higher last week. This is reproducible by simply having a higher load.

Comment by ravindra [ 19/Jan/16 ]

ok thanks just saw that. Now any way to reproduce this issue, reason behind that is as I said we have been running in production for 6 months and never faced this issue and it happened twice in last 1 week.

Regards
Ravindra

Comment by Craig Wilson [ 19/Jan/16 ]

Ravindra,

It has been released already. You can get it from nuget or github.

Craig

Comment by ravindra [ 19/Jan/16 ]

We are running into the same issue, however our site is using mongoDB for the past 6 months with legacy API , but seen this issue for the first time on Saturday and again Today. Recycling the web server process (w3wp.exe) fixes it but that seems to be a temporary solution.

When can we expect a patch on this.

Ravi

Comment by Bret Ferrier [ 14/Jan/16 ]

Ok, hopefully that will go out to Nuget soon.

Comment by Craig Wilson [ 14/Jan/16 ]

Hi Bret,

Since you have some legacy API involved, I believe the problem is a duplicate of CSHARP-1515 which we found and fixed today. We'll be issuing a patch release (2.2.2) very soon to fix this issue. Basically, the legacy API has a bug in it causing connections to not get returned to the pool. So, keep an eye on that one and look for 2.2.2 shortly.

Craig

Comment by Bret Ferrier [ 14/Jan/16 ]

Craig, our code base contains calls that utilize both the legacy and the new API. With that being said the "hot" paths that are called most often have been converted to use the Async version of the new API and that is where I am seeing the errors as 95% of the traffic is for the same few pages that mainly use the Async version (You can see in the call stack that it is calling ToListAsync.

I have tried to get a mini/full dump of the server when it was in an error state but don't have the same kind of access as its a Azure Website and there wasn't enough memory on the machine left to do a dump. It really seems like the ConnectionPool gets poisoned and never recovers, new servers are spun up but the one is a lame duck.

Comment by Craig Wilson [ 13/Jan/16 ]

Hi Bret,

Are you using the legacy API or the new API? And are you using the sync or async versions?

Craig

Generated at Wed Feb 07 21:39:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.