[SERVER-3013] New DB file created every few seconds Created: 28/Apr/11  Updated: 29/May/12  Resolved: 02/Sep/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Martin Chlupac Assignee: Dwight Merriman
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows Server 2008, 8GB RAM, DB size: 240GB, number of collections: 2000, number of documents: 1 000 000 000


Attachments: File anatomy_of_a_disaster.rar     Zip Archive mongolog.zip     Zip Archive mongolog.zip    
Operating System: ALL
Participants:

 Description   

We started inserting new data into more or less random collections but the lock ratio was very high. (>100% with just 60 inserts/s)

After restart of mongo we started with inserts again and after few minutes the following error occurred in the log and mongo started allocating new db files like crazy. I managed to stop it before it filled the whole disk but it allocated some 50 new files. The process was stopped correctly.

log:

Thu Apr 28 17:15:05 [conn4] insert rcware.values330 140ms
Thu Apr 28 17:15:08 [conn4] insert rcware.values321 187ms
Thu Apr 28 17:15:08 [conn4] insert rcware.values323 187ms
Thu Apr 28 17:15:09 [conn4] insert rcware.values302 109ms
Thu Apr 28 17:15:14 [conn4] VirtualProtect failed 66616814000000 4000000 errno:1455 The paging file is too small for this operation to complete.
Thu Apr 28 17:15:14 [conn4] rcware.values1850 Assertion failure false db\mongommf.cpp 72
Thu Apr 28 17:15:14 [conn4] rcware.values1850 caught assertion update unindex rcware.values1850.$r_1_t_1_g_1
Thu Apr 28 17:15:14 [conn4] VirtualProtect failed 66616814000000 4000000 errno:1455 The paging file is too small for this operation to complete.
Thu Apr 28 17:15:14 [conn4] rcware.values1850 Assertion failure false db\mongommf.cpp 72
Thu Apr 28 17:15:14 [conn4] rcware.values1850 caught assertion update index rcware.values1850.$r_1_t_1_g_1 0 assertion db\mongommf.cpp:72
Thu Apr 28 17:15:14 [conn4] VirtualProtect failed 45494714000000 4000000 errno:1455 The paging file is too small for this operation to complete.
Thu Apr 28 17:15:14 [conn4] rcware.values1850 Assertion failure false db\mongommf.cpp 72



 Comments   
Comment by Martin Chlupac [ 02/Sep/11 ]

We actually stopped testing on windows because it was taking too much time and had to move back to Postgresql. I cannot provide you any more information than what was said till now, sorry.

Comment by Eliot Horowitz (Inactive) [ 02/Sep/11 ]

Please let us know if this is still happening

Comment by Martin Chlupac [ 11/May/11 ]

I will turn off the journaling and try again... but it takes some time to restore the data.

Interesting thing is that when I am importing data - the number of inserts per second is quite ok (limited by network speed) ~ several thousands records per second (but they all go into one collection). When I switch to regular data traffic, which consists mostly of "random" inserts/updates to "random" collections - the speed drops to numbers that you see in the log.

Comment by Martin Chlupac [ 11/May/11 ]

See the last attachment - there is a complete mongo log and mongostat log. There are also some screenshots which seem to show that it is not memory allocation problem though...

I have not recorded the journal file size but it was after fresh start and there were only few hundreds or thousands of inserts at most so I would expect it rather small.

Comment by Testo [ 11/May/11 ]

Okay. Now we know that it can only allocate maximum 27.4 GB for some reason and MongoDB might have gone over that. Did you manage to capture db.serverStatus or mongostat before it starts throwing error? The total size of journal files would be helpful too.

Comment by Martin Chlupac [ 11/May/11 ]

You did not specify what exactly should I test so picked two tests that seemed to make sense:

C:\Programs\testLimit>testlimit64.exe -r

Testlimit v5.04 - test Windows limits
By Mark Russinovich - www.sysinternals.com

Reserving private bytes (MB)...
Leaked 8388549 MB of reserved memory (8388549 MB total leaked). Lasterror: 8
Not enough storage is available to process this command.

C:\Programs\testLimit>testlimit64.exe -m 100

Testlimit v5.04 - test Windows limits
By Mark Russinovich - www.sysinternals.com

Leaking private bytes 100 MB at a time...
Leaked 23800 MB of private memory (23800 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 2700 MB of private memory (26500 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 200 MB of private memory (26700 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 100 MB of private memory (26800 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (26800 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
......
Leaked 0 MB of private memory (27400 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.

I also attached a complete mongo log, mongostat log and screenshots from this test.

> db.stats()
{
"db" : "rcware",
"collections" : 2007,
"objects" : 1053105968,
"avgObjSize" : 79.80527079113467,
"dataSize" : 84043406948,
"storageSize" : 118674339072,
"numExtents" : 18813,
"indexes" : 4007,
"indexSize" : 101771298176,
"fileSize" : 246773972992,
"ok" : 1
}

Comment by Martin Chlupac [ 06/May/11 ]

16GB is preallocated by OS by default and it depends on the size of RAM and number of disks....

It takes few hours to restore everything from backups and replicate the error.. anything else you would like to see when I am doing this?

Comment by Martin Chlupac [ 04/May/11 ]

The paging file is managed by OS so it can grow freely. OS reports that it has allocated some 16GB right now but it's also default size so I would not say it is a swap size issue.

What I have not said is that the server is master in master-slave pair and the same problem is replicated to the slave but without the error messages. The slave is real hw.

Comment by Dwight Merriman [ 04/May/11 ]

I suspect that mongod is using too much page file reservation currently. I will investigate. In the meantime making your paging file bigger may mitigate. How big is it currently configured for?

Comment by Testo [ 02/May/11 ]

It will help find out the problem if you can attach your os detail, db.stats() and mongostat when the problem starts happening. File mapping error is a bit hard to diagnose as the behavior is quite vary depending on OS and its configuration

Sent via PocketJIRA - www.pocketjira.com

Comment by Martin Chlupac [ 30/Apr/11 ]

swap is managed by OS so it does not have fixed size.

I will add page files to all drives so there will be hundreds of GB just for swap...

There was very high lock ratio before the issue occurred in the log - could that be related to small RAM?

Comment by Testo [ 30/Apr/11 ]

If you have only 4gb available on the os disk, it's very likely that you run out of swap space. For each file mapping, windows needs to maintain mapping table in memory. It can use a lot of memory especially when the data size is several hundred GBs.

Can you free up disk space and make sure the swap setting is set to dynamic instead of fixed size

Sent via PocketJIRA - www.pocketjira.com

Comment by Martin Chlupac [ 29/Apr/11 ]

Swap is managed by OS and it should be limited only by the free space on the disk. OS reports that there are currently 8GB allocated for page file and there are still some 4GB lefts on that disk. DB is on a different disk.

DB was at about 240GB when the log I posted started.

How is swap involved in mongo memory management anyway?

Comment by Testo [ 29/Apr/11 ]

Sounds like you run out of swap space. How much swap did you set? How big was the database before it died?

Comment by Eliot Horowitz (Inactive) [ 29/Apr/11 ]

Can you send more of the log (ideally the whole thing)?
Is this running the 64 bit version?

Generated at Thu Feb 08 03:01:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.