[SERVER-42662] Memory leak in 3.6.13? Created: 07/Aug/19  Updated: 29/Oct/19  Resolved: 29/Oct/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.13
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Adrien Jarthon Assignee: Danny Hatcher (Inactive)
Resolution: Done Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2019-08-07-13-16-20-281.png     PNG File memory.png     Text File status-3.4.16.txt     Text File status-3.6.13.txt    
Operating System: ALL
Participants:

 Description   

Hi, we're currently trying to upgrade from 3.4.16 to 3.6.13 to hopefully improve the performance issues we've seen in 3.4 (ex SERVER-39355, SERVER-42256, SERVER-42062) but so far after 1 week running as secondary we're only seen worse cache eviction performance, and a memory leak. Here is a ticket for the memory leak issue.

This is the comparison between our 2 secondary servers (we have two replicaset):

yellow is 3.4.16
green is 3.4.16 until 7/31 and then 3.6.13

The workload didn't change of course, it's just taking more and more memory since 3.6.13.
Here is the interesting difference between the two:

3.4.16:

dimelo:SECONDARY> db.serverStatus().tcmalloc.tcmalloc.formattedString
------------------------------------------------
MALLOC:   108775455752 (103736.4 MiB) Bytes in use by application
MALLOC: +   5002575872 ( 4770.8 MiB) Bytes in page heap freelist
MALLOC: +   2222075272 ( 2119.1 MiB) Bytes in central cache freelist
MALLOC: +       107520 (    0.1 MiB) Bytes in transfer cache freelist
MALLOC: +    165397104 (  157.7 MiB) Bytes in thread cache freelists
MALLOC: +    584966400 (  557.9 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: = 116750577920 (111342.0 MiB) Actual memory used (physical + swap)
MALLOC: +  24346746880 (23218.9 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: = 141097324800 (134560.9 MiB) Virtual address space used
MALLOC:
MALLOC:        5777478              Spans in use
MALLOC:           2860              Thread heaps in use
MALLOC:           4096              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

3.6.13:

dimelo_shard:SECONDARY> db.serverStatus().tcmalloc.tcmalloc.formattedString
------------------------------------------------
MALLOC:   118992583216 (113480.2 MiB) Bytes in use by application
MALLOC: +  25706909696 (24516.0 MiB) Bytes in page heap freelist
MALLOC: +   1208287368 ( 1152.3 MiB) Bytes in central cache freelist
MALLOC: +        56768 (    0.1 MiB) Bytes in transfer cache freelist
MALLOC: +    862328712 (  822.4 MiB) Bytes in thread cache freelists
MALLOC: +    648044800 (  618.0 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: = 147418210560 (140589.0 MiB) Actual memory used (physical + swap)
MALLOC: +  13225304064 (12612.6 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: = 160643514624 (153201.6 MiB) Virtual address space used
MALLOC:
MALLOC:        5899545              Spans in use
MALLOC:            882              Thread heaps in use
MALLOC:           4096              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

The difference is most notably in "Bytes in page heap freelist", why aren't them released to the OS? could this also be related to SERVER-37795? could this be related to the poorer cache eviction performance we're seeing in SERVER-42256?

Thank you.



 Comments   
Comment by Adrien Rey-Jarthon [ 28/Oct/19 ]

Ok, we'll see how it goes

Comment by Danny Hatcher (Inactive) [ 28/Oct/19 ]

As we're fairly confident that this is not a memory leak and is simply part of your larger eviction issues (SERVER-42698), I'm going to close this ticket. If we do discover a memory leak through the investigation SERVER-42698, we can always re-open this ticket or create a new one.

Comment by Adrien Jarthon [ 12/Aug/19 ]

Ok I now have about 2 days of log with the instrumentation, not sure if it's of any help but I uploaded them.
It's not gonna be relevant around sunday 10:30 am (UTC+2) because we tried putting 3.6 as primary for a bit (SERVER-42698)

Comment by Adrien Jarthon [ 09/Aug/19 ]

Mmm ok, let's see how it goes then

Comment by Danny Hatcher (Inactive) [ 09/Aug/19 ]

bigbourin@gmail.com, you beat my comment by a few minutes. I've discussed this potential memory leak with some coworkers and the general conclusion is that this is not a leak at all but a fragment of the cache-related issues that you've been having. In that vein, we do not believe that the heap profiler will be necessary in this case. You can safely re-restart the node without the heap profiler option while we investigate SERVER-42698 to confirm what is going on there.

Comment by Adrien Jarthon [ 09/Aug/19 ]

I spent more time doing tests for SERVER-42698 which didn't reveal any obvious performance issue for 3.6.13 standalone so we'll try running it as secondary for a bit longer, we've restarted it with the option you asked today.
We'll probably also test a stepdown on Sunday to make it primary to see if it dies or not.

Comment by Danny Hatcher (Inactive) [ 08/Aug/19 ]

Ah I apologize, I should have realized that you were using PSA. I think it's reasonable then for you to downgrade back down to 3.4 for now until we are more confident about the answer to SERVER-42698.

Comment by Adrien Jarthon [ 08/Aug/19 ]

Well we don't have another secondary as we're in PSA so not really. I understand it is hard, and it's also annoying for us to spend half of our time fighting mongodb performance issues. We can't spend thousands more on servers (for PSS) just to help you fix your issues, we're not paid for that unfortunately ☹

Maybe once we know why SERVER-42698 happens and we have proof that 3.6.13 won't kill us as primary we can afford to take this risk longer. We've had enough downtimes because of mongo already.

Comment by Danny Hatcher (Inactive) [ 08/Aug/19 ]

Would it be possible to change the priority of the 3.6 node to 0 so that it never is eligible to become Primary? I understand if that isn't possible or sufficient but it will be very hard for us to diagnose a root cause without that information.

Comment by Adrien Jarthon [ 08/Aug/19 ]

You're welcome. Unfortunately I'm not sure we'll be able to do this as we're currently discussing reverting the secondary to 3.4.16. Indeed we don't want to risk leaving this version in production if it may die instantly when elected primary (cf SERVER-42698).

Comment by Danny Hatcher (Inactive) [ 08/Aug/19 ]

Thanks for the diagnostics. I can confirm that I do see a rise in resident, page_heap_free, and page_heap_committed over time. As mentioned in my last comment, the next step will be to turn on the heap profiler on the 3.6 node and let it run for 1-2 days. After that period, you can restart the node without the heap profiler and provide both the full logs and diagnostics from the timeframe. There shouldn't be any other commands for you to run at this time.

Comment by Adrien Jarthon [ 08/Aug/19 ]

Sure, I just uploaded the two diagnostic files (doesn't go up to the start date but should show the increase) for the 2 secondaries (mongo2.s1 = 3.4.16, mongo1.s2 = 3.6.13). About the option we'll try adding it on the 3.6.13 secondary later.
If you need some commands or information before mongo restarts let us know before we do this

Comment by Danny Hatcher (Inactive) [ 07/Aug/19 ]

Additionally, as mentioned by Bruce's comment on SERVER-37795, would it be possible to turn the heap profiler on the problem node?

--setParameter heapProfilingEnabled=true

It will impact server performance a small amount but will help us see where exactly memory is being used. If you'd prefer I look at your existing data first I can absolutely do that but I wanted to let you know ahead of time what will likely be the next step.

Comment by Danny Hatcher (Inactive) [ 07/Aug/19 ]

Thanks for the in-depth report bigbourin@gmail.com, it's greatly appreciated. For a first step, can you please provide the "diagnostic.data" folder from both servers to our Support Uploader? It will provide the tcmalloc data and others on a timeline so we can see how it changes over time. I'd like to compare that information to the tickets you linked.

Comment by Adrien Jarthon [ 07/Aug/19 ]

I also attached complete serverStatus output for both servers if you need.

Generated at Thu Feb 08 05:01:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.