[SERVER-25663] Odd connection timeouts and rejections when replicaset secondary is lagged Created: 17/Aug/16  Updated: 01/Feb/17  Resolved: 01/Feb/17

Status: Closed
Project: Core Server
Component/s: Networking, Stability
Affects Version/s: 3.2.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Chad Kreimendahl Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-25070 Significant delays when large collect... Closed
Related
related to CSHARP-1748 Not catching certain error scenarios ... Closed
Operating System: ALL
Steps To Reproduce:

The scenarios used to make this happen are not easily reproducible. Our setup is a 3 member replicaset in which one of them becomes angry enough to start closing connections.

Participants:

 Description   

We were performing a rather massive set of updates. During these updates, one of the secondaries began to show SEND_ERROR messages in the logs, and also began to lag behind the primary. The CSHARP client was not catching the errors, causing everything to be inoperable from a client standpoint.

The only values that show up in the logfiles are:
2016-08-17T10:53:57.994-0500 I NETWORK [conn886174] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [webserver:55640]

– I will also file a report on this with the CSHARP crew, as it appears to be unhandled coming back to the client.



 Comments   
Comment by Kelsey Schubert [ 01/Feb/17 ]

Hi sallgeud,

Thanks for confirming you haven't encountered SERVER-25070 since upgrading to MongoDB 3.2.10. My understanding is that this issue has been resolved as well, so I'm updating the ticket. Please let me know if that isn't the case.

Thank you,
Thomas

Comment by Kelsey Schubert [ 26/Aug/16 ]

Hi sallgeud,

Thank you for the information. I've created a secure upload portal for you to use.

Kind regards,
Thomas

Comment by Chad Kreimendahl [ 26/Aug/16 ]

1. mongodump -o /backups/repl1
2. 3.2.8
3. Thousands. ~100
4. yes (a secondary)
5. at first near 0. By the end, it has large spikes. Will get exact numbers when I can test after-hours
6. about 60% (nearly all by mongod)
7. secure upload?

Comment by Kelsey Schubert [ 18/Aug/16 ]

Hi sallgeud,

Thanks for the additional information. I have a few questions so we can continue to investigate what is happening when the mongodump is executed.

  1. Would you please provide the command lines for mongodump that you have executed?
  2. Which version of mongodump are you using?
  3. How many collections are in your system? How many databases?
  4. Is the mongodump being executed on the same machines as the mongod instance?
  5. How much memory is mongodump using?
  6. How much memory is typically used when mongodump isn't running?
  7. Please upload the diagnostic.data of the affected mongod.

Thank you again for your help,
Thomas

Comment by Chad Kreimendahl [ 18/Aug/16 ]

It appears that the slowness started approximately 20 minutes into a "mongodump" backup being performed on the secondary in question. When these mongodump processes run, they eat up every available ounce of memory eventually, sometimes forcing mongod to use swap (vm.swapiness=1 because swap is bad but OOM is worse). Based on observations, their is either some form of memory leak in mongodump, or some highly unnecessary usage of memory. This appears to remain true whether you use any flags (gzip, j, etc) or not.

In this low memory situation, with some data in mongod swapping, we get enormously long queries. Finds that typically take 2 - 10ms begin to take between 5 and 100 seconds. It was in this scenario where the problem began. The SEND_ERROR we were seeing is likely the client side nuking the connection because it took too long.

Specifically we had an empty find that was returning 25 records from a collection take 100+ seconds. That seems outside the realm of swap alone being the problem. "iostat" on the system at the time showed nearly no disk activity, other than mongodump writes (which appear to happen in bulk)

Comment by Chad Kreimendahl [ 17/Aug/16 ]

CSHARP-1748

Generated at Thu Feb 08 04:09:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.