[CSHARP-1748] Not catching certain error scenarios from replicaset members Created: 17/Aug/16  Updated: 05/Apr/19  Resolved: 11/Jan/18

Status: Closed
Project: C# Driver
Component/s: Connectivity
Affects Version/s: 2.2.4
Fix Version/s: None

Type: Task Priority: Critical - P2
Reporter: Chad Kreimendahl Assignee: Robert Stam
Resolution: Cannot Reproduce Votes: 0
Labels: question
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-25663 Odd connection timeouts and rejection... Closed

 Description   

We had an issue today where, in a 3 member replicaset, when one of the secondaries becomes angry and starts erroring out on simple connectivity, that our entire site may go down in a 503 scenario.

SERVER-25663 is the related issue. We are unable to reproduce, because the scenario in which the secondary got into its mess is uncertain.

We are sure, however, that there must be some unhandled exception coming back in the client connectivity, causing pure failure of the site. It may be an unexpected network error.

Simply shutting down the angry secondary immediately fixed the issue. When the angry secondary came back up, it syncd and all remained well.



 Comments   
Comment by Robert Stam [ 11/Jan/18 ]

From the information provided it appears that this was as server issue and not a driver issue.

Comment by Chad Kreimendahl [ 18/Aug/16 ]

We did quite a bit more advanced research on this issue and believe it may be a case in which C# cannot effectively handle the problem. The major issue here is obviously a problem with SERVER-25663. Based on what I'll describe below, I'm not currently sure it was something that could be handled. (I'll add this to SERVER-25663 as well)

It appears that the slowness started approximately 20 minutes into a "mongodump" backup being performed on the secondary in question. When these mongodump processes run, they eat up every available ounce of memory, sometimes forcing mongod to use swap (vm.swapiness=1 because swap is bad but OOM is worse). Based on observations, their is either some form of memory leak in mongodump, or some highly unnecessary usage of memory.

In this low memory situation, with some data in mongod swapping, we get enormously long queries. Finds that normally take 2-10ms begin to take between 5 and 100 seconds. It was in this scenario where the problem began. The SEND_ERROR we were seeing is likely the client side nuking the connection because it took too long.

Comment by Chad Kreimendahl [ 18/Aug/16 ]

Questions answered... follow up comment has some new findings

1. Yes, reads are always happening from numerous processes, all the time. (as are writes... in this case, there was a specific batch job running against 1 collection [of thousands])
2. 95% PrimaryPreferred, 5% SecondaryPreferred (by volume)
3. 1 Primary, 3 Secondaries. 1 of the secondaries is "hidden" and in our disaster recovery data center
4. Our problem was that we didn't get any stack traces, because any queries were resulting in abnormally long responses (see next comment). We use a technology that attempts to log all exceptions to Mongo, given that mongo has been our most reliable data store to this point. It also emails errors to a distro list. All the emails are general w3wp failures.

Comment by Craig Wilson [ 17/Aug/16 ]

Thanks. Another couple questions:

1. Was your site doing reads and the massive set of updates mentioned in SERVER-25663 were happening from another process?
2. What read-preference does your site use?
3. What is the configuration of your replica set? i.e., Do you have 1 primary, 1 secondary, and an arbiter? Two secondaries? etc...
4. What makes you think the driver wasn't catching exceptions? Could you provide some of the exceptions and stack traces you caught so we can see them?

Thanks.

Comment by Chad Kreimendahl [ 17/Aug/16 ]

Without restarting the app or doing anything other than shutting down the service on the bad secondary. It worked within 5 seconds after. We did have circumstances when things were just absurdly slow and timed out. However, we eventually began to get 503 errors, even though we're catching all exceptions with Application_Error code.

Comment by Craig Wilson [ 17/Aug/16 ]

That is certainly odd, and without being able to reproduce, that's going to make it that much harder to find. In the low level connection code, we catch every exception. That being said, client code is still required to catch exceptions, which I assume you are already doing.

You stated this: "Simply shutting down the angry secondary immediately fixed the issue.". Do you mean that without restarting the app or, doing anything at all, shutting down the secondary caused your app to begin working again? If so, when it wasn't working, are there exceptions you were seeing, or were things taking a long time? How did you know something was wrong?

Generated at Wed Feb 07 21:40:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.