[SERVER-28392] Memory leak and sluggish performance after tens of hours of uptime Created: 20/Mar/17 Updated: 31/May/17 Resolved: 28/Mar/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance |
| Affects Version/s: | 3.2.10, 3.2.12 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tudor Aursulesei | Assignee: | Kelsey Schubert |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
After two days of uptime, some of my shards start responding slower and slower to queries. Even if i stop all the workers doing load into the database and wait for all operations to finish, the mongod instances keep responding very slow when restarting the workers. Queries that took 0.1 seconds to run take 40-50 seconds or more. The operations that seem to trigger this behaviour are bulk updates to a collection ($pull a "job") and insertion into another collection from a different database. The wired tiger cache fills up on those machines, the mongod instance eats up all available RAM and CPU. Running
doesn't help. I am pretty sure this is a bug, because after i restart all the mongod instances i have no problem whatsoever for 2-3 days. I have a very fast storage so i don't mind loading the hot data. How can i investigate this problem? What metrics should i monitor? I've tried creating a dummy stress test script to run agains a 3.2 instance to see if i can trigger the bug and compare with a 3.4 instance, but i haven't succeeded yet. Running too many bulk inserts and updates in the same collection seems to be the culprit. The operations start waiting one after the other and the yields start piling up one on eachother. This is when RAM and CPU usage spike and performance degrades until the restart. |
| Comments |
| Comment by Kelsey Schubert [ 28/Mar/17 ] | |||||||
|
Hi thestick613, Thanks for update to confirm that the root cause was outside of MongoDB. I'm glad you were able to resolve this issue. Kind regards, | |||||||
| Comment by Tudor Aursulesei [ 28/Mar/17 ] | |||||||
|
We think we got to the root of the problem. We disabled LRO on all our network interfaces (https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1027511) and the problem went away. The CPU spikes were generated by the linux kernel, which tried to reassemble packets that were too large. This isn't a mongodb server issue. | |||||||
| Comment by Tudor Aursulesei [ 24/Mar/17 ] | |||||||
|
You are right - i am opening a lot of cursors sometimes; unfortunately, i think that this server is a hot shard. | |||||||
| Comment by Kelsey Schubert [ 24/Mar/17 ] | |||||||
|
Hi thestick613, I've reviewed the data you've provided. mongod appears to be increasing its memory consumption outside of the WiredTiger cache. We've been able to rule out a number of common explanations for this behavior such as an increase in connections or a large number of open cursors, but do not have enough information to conclusively determine the root cause. To help us continue to investigate, would you be willing to restart an affected mongod with the heap profiler enabled and run it for a couple days?
This will allow us to identify which threads are accumulating memory. Please note that this is a diagnostic setting and we have observed a 10-30% performance impact on typical CPU-bound workloads. After reviewing the diagnostic.data and iostat.log, I believe that enabling this setting should not significantly harm your performance, and will provide the data to help us determine the root cause. However, to be safe, I would recommend more carefully monitoring the mongod after enabling the heap profiler. I've created a secure upload portal for you to use. After collecting the additional diagnostic.data please upload the following:
For your information in MongoDB 3.2, diagnostic.data periodically collections the output of the following commands, which you are welcome to execute yourself to examine the output.
Please let me know if you have any questions or concerns about the steps outlined above. Thank you, | |||||||
| Comment by Tudor Aursulesei [ 24/Mar/17 ] | |||||||
|
I've only wanted to share the files with the mongod dev team, just to be extra safe. I'd rather not share the files publicly, but only with you. I haven't found any information on what is inside the diagnostics folder - as it is a binary format. | |||||||
| Comment by Kelsey Schubert [ 24/Mar/17 ] | |||||||
|
Hi thestick613, Thanks for uploading the diagnostic data. I see you've marked your last comment with a nonstandard visibility setting. Please note that the comment visibility in JIRA does not affect the visibility of attachments. Therefore, while the requested information does not contain any user information, I've preemptively removed the attachment from the ticket in case it was not your intention to publicly share the files. Kind regards, | |||||||
| Comment by Kelsey Schubert [ 20/Mar/17 ] | |||||||
|
Hi thestick613, Thanks for reporting this issue, so we can continue to investigate would you please upload an archive of the diagnostic.data directory? Additionally, would you please execute the following shell script:
After encountering the issue again, please reupload the diagnostic.data along with iostat.log. This will allow us to identify whether there is a system bottleneck that is contributing to this behavior. Thank you, | |||||||
| Comment by Tudor Aursulesei [ 20/Mar/17 ] | |||||||
|
These two issues seem to describe the same issue: |