[CSHARP-2648] Connection Reset By Peer - with driver 2.8.0 and mongo 4.0.9 on a k8s cluster Created: 24/Jun/19  Updated: 20/Jul/20  Resolved: 20/Jul/20

Status: Closed
Project: C# Driver
Component/s: Connectivity
Affects Version/s: 2.8.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alok Kumar Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

AWS EKS, Ubuntu base images and Ubuntu host for clients, Mongo 4.0.9 docker image (replica set) hosted on AWS EKS - same cluster as the clients.



 Description   

We have been getting "Connection Reset by Peer" mongo errors in our setup. A description of the setup:

  • mongo running as a replicaset in a k8s cluster on EKS
  • clients (C#) running in the same k8s cluster on EKS
  • mongo 4.0.9
  • C# driver 2.8.0
  • Connection pooling ON
  • max idle time not set (defaults to 10s)
  • max connection lifetime not set (defaults to 10s)

We get these errors. We observed that if there is a series of calls, say 500 calls to do a key based select, there is no issue. Then we pause for 5 minutes, and repeat the test, the first time we get a "Connection Reset by Peer". Later, the test continues. This happens every time after pause.

This condition repeats with real users behavior, there may be spurts of activity and then a lull. As a consequence we keep getting "Connection reset by peer" at critical parts in the business workflow. On the client side, the solution is to perform defensive coding and repeat the call, but that's a change in many places.

Other combinations attempted:

  • mongo 4.0.9
  • C# driver 2.8.0
  • Connection pooling ON
  • max idle time 120s
  • max connection lifetime 60s

However no change in the behavior.

It appears to us that while the TCP connection is closed on the server side, the client still thinks that it's a valid connection and attempts to use it, leading to this error.

Has anybody else faced such a situation? Any suggestions would be appreciated, happy to provide more information if needed.



 Comments   
Comment by Jeffrey Yemin [ 20/Jul/20 ]

Hi alok.kumar@lendfoundry.com

Sorry for losing track of this. Do you have a full stack trace available? I'm surprised this would happen after a pause given that idle connections in the pool should have been pruned in that 5 minute interval.

Also, as this sounds more like a support issue, I wanted to give you some other resources to get this question answered more quickly:

I'm going to close this now, but happy to re-open if you have more information.

Comment by Riaz Ahmad [ 23/Jul/19 ]

This issue looks to be related to CSHARP-2621

Comment by Alok Kumar [ 25/Jun/19 ]

I need to make a correction to this issue report.

 

Initial settings were

  • max idle time set 10 minutes
  • max connection lifetime 10 minutes

 

We used to get connection reset by peer errors but not many. However they existed.

 

Subsequent settings were

 

  • max idle time set to 120 minutes
  • max connection lifetime set to 60 minutes

 

After this change, the connection reset by peer errors have increased.

 

Now, we have changed these settings to the following values;

 

  • max idle time set to 60secs
  • max connection lifetime set to 60 secs

 

The errors have now dropped significantly.

With larger connection lifetimes, the error count is higher.

I could not find a way to edit the original issue, if someone can edit that for me it would be appreciated.

 

Generated at Wed Feb 07 21:43:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.