[SERVER-25663] Odd connection timeouts and rejections when replicaset secondary is lagged Created: 17/Aug/16 Updated: 01/Feb/17 Resolved: 01/Feb/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking, Stability |
| Affects Version/s: | 3.2.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Chad Kreimendahl | Assignee: | Kelsey Schubert |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | The scenarios used to make this happen are not easily reproducible. Our setup is a 3 member replicaset in which one of them becomes angry enough to start closing connections. |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
We were performing a rather massive set of updates. During these updates, one of the secondaries began to show SEND_ERROR messages in the logs, and also began to lag behind the primary. The CSHARP client was not catching the errors, causing everything to be inoperable from a client standpoint. The only values that show up in the logfiles are: – I will also file a report on this with the CSHARP crew, as it appears to be unhandled coming back to the client. |
| Comments |
| Comment by Kelsey Schubert [ 01/Feb/17 ] |
|
Hi sallgeud, Thanks for confirming you haven't encountered Thank you, |
| Comment by Kelsey Schubert [ 26/Aug/16 ] |
|
Hi sallgeud, Thank you for the information. I've created a secure upload portal for you to use. Kind regards, |
| Comment by Chad Kreimendahl [ 26/Aug/16 ] |
|
1. mongodump -o /backups/repl1 |
| Comment by Kelsey Schubert [ 18/Aug/16 ] |
|
Hi sallgeud, Thanks for the additional information. I have a few questions so we can continue to investigate what is happening when the mongodump is executed.
Thank you again for your help, |
| Comment by Chad Kreimendahl [ 18/Aug/16 ] |
|
It appears that the slowness started approximately 20 minutes into a "mongodump" backup being performed on the secondary in question. When these mongodump processes run, they eat up every available ounce of memory eventually, sometimes forcing mongod to use swap (vm.swapiness=1 because swap is bad but OOM is worse). Based on observations, their is either some form of memory leak in mongodump, or some highly unnecessary usage of memory. This appears to remain true whether you use any flags (gzip, j, etc) or not. In this low memory situation, with some data in mongod swapping, we get enormously long queries. Finds that typically take 2 - 10ms begin to take between 5 and 100 seconds. It was in this scenario where the problem began. The SEND_ERROR we were seeing is likely the client side nuking the connection because it took too long. Specifically we had an empty find that was returning 25 records from a collection take 100+ seconds. That seems outside the realm of swap alone being the problem. "iostat" on the system at the time showed nearly no disk activity, other than mongodump writes (which appear to happen in bulk) |
| Comment by Chad Kreimendahl [ 17/Aug/16 ] |