[CSHARP-1748] Not catching certain error scenarios from replicaset members Created: 17/Aug/16 Updated: 05/Apr/19 Resolved: 11/Jan/18 |
|
| Status: | Closed |
| Project: | C# Driver |
| Component/s: | Connectivity |
| Affects Version/s: | 2.2.4 |
| Fix Version/s: | None |
| Type: | Task | Priority: | Critical - P2 |
| Reporter: | Chad Kreimendahl | Assignee: | Robert Stam |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | question | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Description |
|
We had an issue today where, in a 3 member replicaset, when one of the secondaries becomes angry and starts erroring out on simple connectivity, that our entire site may go down in a 503 scenario.
We are sure, however, that there must be some unhandled exception coming back in the client connectivity, causing pure failure of the site. It may be an unexpected network error. Simply shutting down the angry secondary immediately fixed the issue. When the angry secondary came back up, it syncd and all remained well. |
| Comments |
| Comment by Robert Stam [ 11/Jan/18 ] |
|
From the information provided it appears that this was as server issue and not a driver issue. |
| Comment by Chad Kreimendahl [ 18/Aug/16 ] |
|
We did quite a bit more advanced research on this issue and believe it may be a case in which C# cannot effectively handle the problem. The major issue here is obviously a problem with It appears that the slowness started approximately 20 minutes into a "mongodump" backup being performed on the secondary in question. When these mongodump processes run, they eat up every available ounce of memory, sometimes forcing mongod to use swap (vm.swapiness=1 because swap is bad but OOM is worse). Based on observations, their is either some form of memory leak in mongodump, or some highly unnecessary usage of memory. In this low memory situation, with some data in mongod swapping, we get enormously long queries. Finds that normally take 2-10ms begin to take between 5 and 100 seconds. It was in this scenario where the problem began. The SEND_ERROR we were seeing is likely the client side nuking the connection because it took too long. |
| Comment by Chad Kreimendahl [ 18/Aug/16 ] |
|
Questions answered... follow up comment has some new findings 1. Yes, reads are always happening from numerous processes, all the time. (as are writes... in this case, there was a specific batch job running against 1 collection [of thousands]) |
| Comment by Craig Wilson [ 17/Aug/16 ] |
|
Thanks. Another couple questions: 1. Was your site doing reads and the massive set of updates mentioned in Thanks. |
| Comment by Chad Kreimendahl [ 17/Aug/16 ] |
|
Without restarting the app or doing anything other than shutting down the service on the bad secondary. It worked within 5 seconds after. We did have circumstances when things were just absurdly slow and timed out. However, we eventually began to get 503 errors, even though we're catching all exceptions with Application_Error code. |
| Comment by Craig Wilson [ 17/Aug/16 ] |
|
That is certainly odd, and without being able to reproduce, that's going to make it that much harder to find. In the low level connection code, we catch every exception. That being said, client code is still required to catch exceptions, which I assume you are already doing. You stated this: "Simply shutting down the angry secondary immediately fixed the issue.". Do you mean that without restarting the app or, doing anything at all, shutting down the secondary caused your app to begin working again? If so, when it wasn't working, are there exceptions you were seeing, or were things taking a long time? How did you know something was wrong? |