[SERVER-42662] Memory leak in 3.6.13? Created: 07/Aug/19 Updated: 29/Oct/19 Resolved: 29/Oct/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.6.13 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Adrien Jarthon | Assignee: | Danny Hatcher (Inactive) |
| Resolution: | Done | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
Hi, we're currently trying to upgrade from 3.4.16 to 3.6.13 to hopefully improve the performance issues we've seen in 3.4 (ex This is the comparison between our 2 secondary servers (we have two replicaset): The workload didn't change of course, it's just taking more and more memory since 3.6.13. 3.4.16:
3.6.13:
The difference is most notably in "Bytes in page heap freelist", why aren't them released to the OS? could this also be related to Thank you. |
| Comments |
| Comment by Adrien Rey-Jarthon [ 28/Oct/19 ] | |
|
Ok, we'll see how it goes | |
| Comment by Danny Hatcher (Inactive) [ 28/Oct/19 ] | |
|
As we're fairly confident that this is not a memory leak and is simply part of your larger eviction issues ( | |
| Comment by Adrien Jarthon [ 12/Aug/19 ] | |
|
Ok I now have about 2 days of log with the instrumentation, not sure if it's of any help but I uploaded them. | |
| Comment by Adrien Jarthon [ 09/Aug/19 ] | |
|
Mmm ok, let's see how it goes then | |
| Comment by Danny Hatcher (Inactive) [ 09/Aug/19 ] | |
|
bigbourin@gmail.com, you beat my comment by a few minutes. I've discussed this potential memory leak with some coworkers and the general conclusion is that this is not a leak at all but a fragment of the cache-related issues that you've been having. In that vein, we do not believe that the heap profiler will be necessary in this case. You can safely re-restart the node without the heap profiler option while we investigate | |
| Comment by Adrien Jarthon [ 09/Aug/19 ] | |
|
I spent more time doing tests for | |
| Comment by Danny Hatcher (Inactive) [ 08/Aug/19 ] | |
|
Ah I apologize, I should have realized that you were using PSA. I think it's reasonable then for you to downgrade back down to 3.4 for now until we are more confident about the answer to | |
| Comment by Adrien Jarthon [ 08/Aug/19 ] | |
|
Well we don't have another secondary as we're in PSA so not really. I understand it is hard, and it's also annoying for us to spend half of our time fighting mongodb performance issues. We can't spend thousands more on servers (for PSS) just to help you fix your issues, we're not paid for that unfortunately ☹ Maybe once we know why | |
| Comment by Danny Hatcher (Inactive) [ 08/Aug/19 ] | |
|
Would it be possible to change the priority of the 3.6 node to 0 so that it never is eligible to become Primary? I understand if that isn't possible or sufficient but it will be very hard for us to diagnose a root cause without that information. | |
| Comment by Adrien Jarthon [ 08/Aug/19 ] | |
|
You're welcome. Unfortunately I'm not sure we'll be able to do this as we're currently discussing reverting the secondary to 3.4.16. Indeed we don't want to risk leaving this version in production if it may die instantly when elected primary (cf | |
| Comment by Danny Hatcher (Inactive) [ 08/Aug/19 ] | |
|
Thanks for the diagnostics. I can confirm that I do see a rise in resident, page_heap_free, and page_heap_committed over time. As mentioned in my last comment, the next step will be to turn on the heap profiler on the 3.6 node and let it run for 1-2 days. After that period, you can restart the node without the heap profiler and provide both the full logs and diagnostics from the timeframe. There shouldn't be any other commands for you to run at this time. | |
| Comment by Adrien Jarthon [ 08/Aug/19 ] | |
|
Sure, I just uploaded the two diagnostic files (doesn't go up to the start date but should show the increase) for the 2 secondaries (mongo2.s1 = 3.4.16, mongo1.s2 = 3.6.13). About the option we'll try adding it on the 3.6.13 secondary later. | |
| Comment by Danny Hatcher (Inactive) [ 07/Aug/19 ] | |
|
Additionally, as mentioned by Bruce's comment on
It will impact server performance a small amount but will help us see where exactly memory is being used. If you'd prefer I look at your existing data first I can absolutely do that but I wanted to let you know ahead of time what will likely be the next step. | |
| Comment by Danny Hatcher (Inactive) [ 07/Aug/19 ] | |
|
Thanks for the in-depth report bigbourin@gmail.com, it's greatly appreciated. For a first step, can you please provide the "diagnostic.data" folder from both servers to our Support Uploader? It will provide the tcmalloc data and others on a timeline so we can see how it changes over time. I'd like to compare that information to the tickets you linked. | |
| Comment by Adrien Jarthon [ 07/Aug/19 ] | |
|
I also attached complete serverStatus output for both servers if you need. |