[SERVER-28974] Mongos leak connections to mongods Created: 26/Apr/17 Updated: 27/Oct/23 Resolved: 09/Jun/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking |
| Affects Version/s: | 3.2.12 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tuomas Silen | Assignee: | Kelsey Schubert |
| Resolution: | Gone away | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | Unknown what the exact trigger is, but it has happened multiple times now during high traffic peaks which have caused congestion in the internal network. |
||||||||
| Participants: | |||||||||
| Description |
|
We ran into an issue where mongos was suddenly creating a lot of connections to mongods and ran out of file descriptors. This eventually consumed all descriptors also on mongods which then crashed (https://jira.mongodb.org/browse/SERVER-28933). This happened when there was some intermittent network congestion. We saw errors like this when the leaks occurred:
"0 connections" there is particularly interesting. Eventually the leaking lead to running out of descriptors:
The logs indicated that there would be about 500 open client connections, but when inspecting with lsof, it revealed that there were less than 10. It also revealed that there were 6 connections to the mongocs, 23 connections to the primary and total of 474 connections to the two secondaries. It also showed 478 "sock" descriptors with "can't identify protocol": Normally there isn't any of those and there's just a few connections to the secondaries as well. This is currently a bit special setup with a single shard (replicaset of 3 nodes) and about 200 mongoses that connect to it. The leaking occurred on all of the mongos instances (all were also affected by the network congestion issue). This is what connPoolStatus showed around that time (although, I've understood that it's not entirely reliable):
|
| Comments |
| Comment by Kelsey Schubert [ 09/Jun/17 ] | ||
|
Hi devastor, Thank you for the update. Since this is no longer an issue for you, I'm resolving this ticket. Please let us know if you encounter similar behavior in the future, and we will continue to investigate. Kind regards, | ||
| Comment by Tuomas Silen [ 30/May/17 ] | ||
|
Hi Thomas, Unfortunately, we have not seen it after upgrading. Also our setup changed a bit so we don't have that many mongoses any more, which might affect it, too. | ||
| Comment by Kelsey Schubert [ 24/May/17 ] | ||
|
Hi devastor, Have you encountered this issue again since upgrading? Thanks, | ||
| Comment by Tuomas Silen [ 03/May/17 ] | ||
|
Thanks Thomas, Just to update, we are now running 3.2.13 and periodically logging the output of connPoolStats, so let's see what happens. | ||
| Comment by Kelsey Schubert [ 27/Apr/17 ] | ||
|
Hi devastor, Thank you for reporting this behavior. We've completed a large amount of work in MongoDB 3.2.13 to improve diagnostics to investigate issues such as this. The release candidate, MongoDB 3.2.13-rc0, is available now, and general availability release is scheduled for early next week. In particular MongoDB 3.2.13 includes:
Would you please upgrade and periodically collect the output of connPoolStats? Please feel free to increase the delay (in seconds) to a larger number that would still show the issue emerge according to the timelines you've observed previously.
If you encounter this issue again, would you please upload the following diagnostic information:
I've created a secure upload portal for you to provide these files. Files uploaded to this portal are only visible to MongoDB employees and are routinely deleted after some time. Thank you for your help, |