[SERVER-3911] memory leak with journaling in windows Created: 20/Sep/11  Updated: 15/Aug/12  Resolved: 04/Mar/12

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 2.0.0
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Pete Brumm Assignee: Tad Marshall
Resolution: Cannot Reproduce Votes: 1
Labels: Windows
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

windows 64bit


Attachments: File mongo_memory.rb    
Issue Links:
Related
is related to SERVER-5194 Windows version of mongod should mana... Closed
Operating System: Windows
Participants:

 Description   

I created a simple test to just insert into a 1mb capped collection
windows
with journaling
memory growth after inserting 2m items is around 4gb and if I let it continue it will use all available memory

without journaling
memory quickly stops growing no mater how long test runs.

on linux
with journaling
memory quickly stops growing no matter how long test is run.
without journaling
memory quickly stops growing no matter how long test is run.

I tried in 1.8.3 version with same problem.



 Comments   
Comment by Tad Marshall [ 04/Mar/12 ]

I did not find evidence of a memory leak after running this test twice for days each time, but Windows's poor working set management explains all the problems that I did see. In an overcommit situation, Windows will give mongod.exe's memory mapped files enough memory to cripple the system.

Comment by Tad Marshall [ 14/Feb/12 ]

I left the Ruby program running for three days, inserting into the capped collection on a primary with two secondaries, all with limited-size oplogs (333 MB each). Memory consumption after three days was the same as it was after a few hours. I do not believe that there is an ongoing cumulative memory leak. There may be excessive memory consumption and the behavior in overcommit situations may indeed cause all available memory to be exhausted, but it isn't a "leak" in the sense that I understand that word.

Comment by Pete Brumm [ 10/Feb/12 ]

Great response. Thanks for the detail.

I wrote the ruby script the way I did because my expectation is that oplog size + mongodb instance memory would be the max size that I would expect memory to grow. After that it would stabilize. I just never saw that stabilize and either of your potential issues would show the same behavior.

and a memory leak or oplog/journaling leak would show the same as that is where the only data is being created in this example.

My guess is the second that it is a memory map release issue because it went away without journaling.

Comment by Tad Marshall [ 10/Feb/12 ]

In case I've confused things, I'm in complete agreement with you that Windows behaves quite differently when journaling is used compared to no journaling, and also that Windows and Linux behave quite differently when journaling is used. I'd like to improve what I can, but some of this behavior is built into the products by design. When journaling is used, we memory map the database twice and virtual memory usage is doubled on both Windows and Linux. In Linux, one copy of the database is mapped copy-on-write, but we can't do this on Windows because Windows wants to reserve page file space for copy-on-write memory mapped files: a 500 GB database would require 500 GB of page file if we tried this. So, on Windows, we set pages to copy-on-write individually instead of all at once. Once they are flushed to disk, they are set back to read/write. In theory, this should return the memory to its initial state: memory mapped twice but only in memory once. Where this isn't working right, I need to fix it.

An actual memory leak would behave differently. Even with limited data, it would consume memory forever. What I'm saying is that I don't think this is happening (or at least I haven't been able to make it happen yet).

That still leaves the first problem, where copy-on-write memory isn't converted back into read/write memory the way it is supposed to be. I'm trying to figure out if I have one problem or two here. I'll let my test run over the weekend and see what it looks like after that. Right now, it's at 7 million insertions and still at 3.09 GB RAM, the same amount it was using at 3.75 million.

Thanks for your patience.

Comment by Pete Brumm [ 10/Feb/12 ]

When I ran it I am not sure what the oplog was set too or whether it was larger than ram.

Free memory was consumed very in with a very steady growth. If I disabled journaling on the two set replica the issue went away.

When I ran the same script against a similar setup on linux with or without journaling the memory usage didn't show this pattern.

We ran it on a test box with 48gb of ram and a 16gb opslog. and it definitely grew past the 16gb (where I expected it to stop) and eventually filled up the 48gb.

Once your test completes, rerun without journaling and see if the memory usage changes.

Comment by Tad Marshall [ 10/Feb/12 ]

Is it possible that what is seen here is not a memory leak but just a very high (and stable, if enough RAM is available) usage of memory? In putting together a test with one primary and two secondaries (replica set) in a 4 GB VirtualBox instance of Windows Server 2008 R2, I had to limit the oplog sizes to get performance that was better than terrible. Once I did that, memory usage is very high (800+ MB in each mongod) but stable. Without limiting the oplog size it might also have tried to stabilize, but at a level well beyond physical RAM, leading to out-of-memory failures before anything like "stable" could have been achieved.

Running Pete's Ruby test with oplog sizes set to 333 MB in each instance, memory usage rose steadily to just over 3 GB (in a 4 GB VirtualBox instance) and has not budged in the last hour. It's not acting like a leak.

I'll let it run for a few days (if it doesn't hang again) and see what it looks like then. It's at 3.75 million insertions right now.

Comment by Tad Marshall [ 09/Feb/12 ]

Overnight test hung (Ruby program stuck after 231 million insertions) and I need to install code to debug it ... more info once I have a full debug environment set up.

Comment by Tad Marshall [ 08/Feb/12 ]

Starting a new test using Pete's Ruby program (writes to capped collection) on Windows Server 2008 R2 with a primary and two secondaries all running in the same instance, and I'll leave it running overnight. Testing this before, without the replica set, I was unable to duplicate the memory leak.

Comment by chris rozacki [ 07/Feb/12 ]

Is anybody working on it now ?

Comment by Tad Marshall [ 09/Jan/12 ]

Sorry about the delay. I will retest with replica sets and see if I can get this to happen.

Comment by Pete Brumm [ 23/Dec/11 ]

I don't have access to the environment to provide the output.

I was also watching the performance monitor for free memory on the box. that was the main thing that was shrinking. Eventually it would run out, but only if journaling was enabled.

try adding a replica set if you have the time.

I had an 8gb box with 4gb free and it would quickly consume that 4gb.

I had a win7 box with a 2 instance replica set. we also reproduced on a production 48gb 3 server replica set running win 2008 server.

There was a difference running with journaling enabled or disabled.

Comment by Tad Marshall [ 23/Dec/11 ]

@Pete, yes, I have journaling enabled (by default in this version) but no replica sets. Can you post the output from db.serverStatus() and db.stats() to help me duplicate the bug? Thanks!

Comment by kobevaliant [ 23/Dec/11 ]

I believe I encountered the same bug. SERVER-4522

Comment by Pete Brumm [ 22/Dec/11 ]

are you running with journaling enabled?
That was the only time it would grow.

It is also possible that both the environments had a 2 or 3 server replica set.

Comment by Tad Marshall [ 22/Dec/11 ]

I retested using 2.0.0 (still on Windows 7 64-bit) and let it run longer. I'm still not seeing the memory growth you describe. I let it go to 42,000,000 iterations and memory usage was flat after the first 1,000,000.

Comment by Pete Brumm [ 21/Dec/11 ]

It was a pretty quick growth. resource monitor showed a steady growth from the beginning and never stopped. I usually stopped at a count of 2M

your env is fine. try with the older 2.0.0. Maybe it was fixed by another commit.

Comment by Tad Marshall [ 21/Dec/11 ]

@Pete, I'm running your Ruby program on Windows 7 64-bit against our current development version and I'm not seeing the continually growing memory usage you see. How long do you need to let it run to see problems? The displayed counter is now at 5000000 and memory usage has not increased at all from what it was at 1000000. Process Explorer, Task Manager, Resource Monitor and VMMap are all showing flat (horizontal) graphs for all memory components for the mongod.exe process. I can try again on Server 2008 R2 and I can try your version (2.0.0) instead of the 2.1.0-pre I'm testing, but I had expected to be able to reproduce it on Windows 7.

Comment by Pete Brumm [ 18/Oct/11 ]

try running the attached script.

in it the database doesn't grow.

it is inserting on a capped collection.

the resident memory is growing quickly

Comment by Dwight Merriman [ 18/Oct/11 ]

virtual memory should equal the size of the datafiles (or bigger) - and that is benign.

what stastic is growing? what is Private Bytes value? can you post a perfmon output?

are you getting a specific error message in the mongo log?

Comment by Mathias Stearn [ 17/Oct/11 ]

@Dwight: Could you take a look at this? Not sure if this is just an artifact of how mem usage is displayed on windows or if it is due to the way the private view is handled differently on windows

Comment by Pete Brumm [ 28/Sep/11 ]

any update on this?

Generated at Thu Feb 08 03:04:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.