[SERVER-10044] Repeated Server Crash. VirtualProtect failure in mongo::makeChunkWritable (Windows 7 x64). Created: 27/Jun/13  Updated: 19/Sep/15  Resolved: 07/May/15

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 2.4.6, 2.6.5
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Maxime Beucher Assignee: Geert Bosch
Resolution: Won't Fix Votes: 18
Labels: VirtualProtect, Windows, crash, makeChunkWritable, server
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

MongoDB 2.4.4.2008+, Windows 7 64Bits, 8GB and 16GB RAM, SSD 500Go, i5-3550 CPU (3.30GHz).
Overall DB file size (C:\Mongo\db) is 68GB.
Several .NET 4.0 applications run on the same host, side by side with mongod, taking up to 4GB of cumulated RAM. Page file size is set to 16GB.


Issue Links:
Documented
Duplicate
is duplicated by SERVER-17833 MongoD Quit with Page file error Closed
Related
is related to SERVER-17745 Improve dirty page estimation in mmap... Closed
Backwards Compatibility: Fully Compatible
Operating System: Windows
Steps To Reproduce:

Standard usage of the server with our configuration. Happens every days.

Sprint: Quint Iteration 3.1.2, Quint Iteration 3
Participants:

 Description   

Our mongo servers keep crashing several times a day, often when the RAM usage on the server is high but not always. Our production server sometimes crash 6 times a day. This issue has been happening on two different hosts (a hardware cause is unlikely).

We are monitoring our servers using MMS (We could provide a temporary user account if necessary).

Here are some example of logs we encountered:

mongo.27018.log	291963	Thu Jun 27 08:03:39.185 [conn711] VirtualProtect for C:/Mongo/db/27018/Forex/Forex.3 chunk 6358 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 43778048, address is 6358000000) in mongo::makeChunkWritable, terminating
...
mongo.27018.log	295678	Thu Jun 27 11:51:03.275 [conn254] VirtualProtect for C:/Mongo/db/27018/Forex/Forex.0 chunk 6806 failed with errno:1455 Unknown error (chunk size is 16777216, address is 6a58fd0000) in mongo::makeChunkWritable, terminating



 Comments   
Comment by Geert Bosch [ 11/May/15 ]

Actually you will be able to use the Azure instances, but you just need to specify a secondary pagefile on the remote volume.

With Windows you can have multiple page files, and it is fine to have pagefiles on slow volumes.

Windows will prefer using the faster files, and MongoDB will never actually need the pagefile in normal operation. Windows just requires it to be present in order for MongoDB to memory map files for writing.

Comment by Nathan Arthur [ 08/May/15 ]

Geert,

Thanks for the detailed update on this. It helps to have a better understanding of it.

Our production systems (2.6.x) are routinely affected by this. So far we've already figured out to set the min and max page file size to the same thing, and we have empirically found that 16,000 MB minimizes the risk. (Based on earlier comments, we assumed that bigger wouldn't help.) But it still happens.

So reading your comments - our system nearly has the behavior for which you suggest the fix might be useful (single thread / lots of small writes / different memory regions). We don't actually use a single thread, but we do all our work in small batches, and we write to many different collections concurrently. (It also seems like this problem tends to happen when there are large queries running - have you seen that?) The point is that blocking all our threads at once would be fine; the entire batch would just end up taking longer. Our system wouldn't create the "more active write operations" scenario.

On the other hand, we do tend to see other memory-intensive services on the host also die when this happens, so maybe your fourth bullet is really the major issue.

Still, though - can you reconsider whether to make this fix, maybe even if it had to be enabled explicitly somehow? Right now, we just have to admit to our customers that we chose a database that won't reliably stay up on Windows. That's a hard thing to say, and it certainly puts our customers off from MongoDB, as well. It would be much better if the operation just failed - we already have automatic recovery code for that type of thing.

Then, looking at the Production Notes, they suggest a 32GB (or larger) page file. We do most of our hosting on Azure, and we've been experimenting with some of their new instances that provide higher disk performance - but the "D" drive (i.e. the locally-attached ephemeral storage) on those instances if very small - less than 32GB. So if we really do have to have at least a 32GB page file, that means we either have to put it on a slow (i.e. remote) volume, or that we can't use that type of instance. So here the message becomes "we can't use that type of environment because MongoDB prevents it".

Finally, it seems like the underlying issue here is essentially a limitation of Windows (vs. Linux). There's nothing in the system that actually uses all that allocated storage, and yet we still have to allocate it to make it work. Have you done anything to push on Microsoft to change the limitation, in the long term? It would be nice to have some hope that eventually this can be truly fixed.

Thank you!

Comment by Geert Bosch [ 27/Apr/15 ]

The fix in SERVER-17745 addresses an issue where we would underestimate the amount of memory made writable and not remap memory mapped files early enough. However, even with this fix it is possible with a default page file configuration to run into VirtualProtect failures due to overcommitment.

We investigated the possibility to retry operations and recover from such errors after any dirty pages have been journaled and files have been remapped. This approach turned out to have a few fundamental problems. The real fix is to follow the newly added recommendations from the Production Notes on configuring the pagefile.

Retrying has the following issues:

  • The retry needs to wait until memory has been remapped
  • In the mean time, there is very little memory available
  • Other threads will not be able to allocate more memory, so will fail with WCE as well
  • There is a good chance that allocations outside of `makeChunkWritable` start to fail and bring the system down
  • Even when reaching the point that remapping has been done and enough memory is available, there are more active write operations now.
  • Even when avoiding the thundering herd problem by exponential back-off, writes that retry typically don't release locks and thus block readers

So, the case in which retrying with WCE is an actual solution is were there is a single thread with lots of small writes in many different memory regions. The downsides for the more realistic cases with multiple clients are such that they outweigh the limited benefits. In all cases, provisioning larger pagefiles and, most importantly, avoiding dynamic resizing, is a far better solution.

Comment by Githook User [ 23/Apr/15 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: SERVER-10044: Simplify DurableInterface

Remove Remove writingPtr(), writing() and declareWriteIntent(). These
are confusing and should not be used. Only DataFileHeader needs direct
access besides the RecoveryUnit and that's a special case, so just make
it use the existing declareWriteIntents function.
Branch: master
https://github.com/mongodb/mongo/commit/1f24d43e419b674bfe866bdebf8224d44fa163b0

Comment by Githook User [ 23/Apr/15 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: Revert "SERVER-10044: Simplify DurableInterface"

This reverts commit 878f4e1a3138ddfc0d7226e003bd48e56c8bbf8a.
Branch: master
https://github.com/mongodb/mongo/commit/c160c0f85cdb1a36f68bc6512928a7a9f040b951

Comment by Githook User [ 22/Apr/15 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: SERVER-10044: Remove confusing message that is no longer applicable with rollback
Branch: master
https://github.com/mongodb/mongo/commit/db96b5908d2da1be53b05a957fc63ed13d6365bd

Comment by Githook User [ 22/Apr/15 ]

Author:

{u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}

Message: SERVER-10044: Simplify DurableInterface

Remove Remove writingPtr(), writing() and declareWriteIntent(). These
are confusing and should not be used. Only DataFileHeader needs direct
access besides the RecoveryUnit and that's a special case, so just make
it use the existing declareWriteIntents function.
Branch: master
https://github.com/mongodb/mongo/commit/878f4e1a3138ddfc0d7226e003bd48e56c8bbf8a

Comment by Geert Bosch [ 17/Apr/15 ]

I have verified that the initial page file size is important, as calls that allocate virtual memory will fail with ERROR_COMMITMENT_LIMIT (1455) while the page file is being expanded. So, in the common and default case where the initial page file is far smaller than its maximum size, the error is transient and just waiting makes sense.

C:\Users\Administrator>d:\testlimit64.exe -m 64
 
Testlimit v5.04 - test Windows limits
By Mark Russinovich - www.sysinternals.com
 
Leaking private bytes 64 MB at a time...
Leaked 6912 MB of private memory (6912 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 832 MB of private memory (7744 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 1216 MB of private memory (8960 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 1344 MB of private memory (10304 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 1472 MB of private memory (11776 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 1728 MB of private memory (13504 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (13568 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (13568 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (13632 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (13632 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (13696 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (13696 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (13760 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (13760 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (13824 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (13824 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (13824 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (13888 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (13888 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (13952 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (13952 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (14016 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14016 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (14080 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14080 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (14144 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (14208 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14208 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (14272 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14272 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (14336 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14336 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14336 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (14400 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14400 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (14464 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14464 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (14528 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14528 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 64 MB of private memory (14592 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14592 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14592 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14592 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14592 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
Leaked 0 MB of private memory (14592 MB total leaked). Lasterror: 1455
The paging file is too small for this operation to complete.
(...)

Comment by Andy Schwerin [ 20/Mar/15 ]

Addressing the VirtualProtect failures may be possible in 3.0 and master by allow the individual write operations to fail via an exception. We can soak test this by adding a fail point to makeChunkWritable, much as we've done with WriteConflictException in WiredTiger. Actually, we might just be able to use WriteConflictException for this. No fix is possible on 2.6 or 2.4, I'm afraid. It is only the storage engine work from 3.0 that makes any solution tractable.

Comment by Nick Judson [ 20/Mar/15 ]

@Andrei Saprykin, what happens if you put your code in a retry loop - say retry X times after sleeping for Y? I'm not a C/C++ programmer otherwise I'd try it myself. Just wondering how long it takes for the virtual memory to be allocated.

Comment by Andrei Saprykin [ 20/Mar/15 ]

https://github.com/mongodb/mongo/blob/eaa338b6668e0d7e1a9e1b7f20f04c2f33815d3c/src/mongo/db/storage/mmap_v1/aligned_builder.cpp calls Windows VirtualAlloc function
void AlignedBuilder::_malloc(unsigned sz) {
_p._size = sz;
#if defined(_WIN32)
void *p = VirtualAlloc(0, sz, MEM_COMMIT | MEM_RESERVE, PAGE_READWRITE);
_p._allocationAddress = p;
_p._data = (char *) p;
#elif defined(_linux_)

I experimented with this function a little bit using the following code ( tries to allocate 36GB )
long long N = 1024LL * 1024 * 128 * 36;
long long nBytes = N * 8;

long long* array;

LPVOID base = VirtualAlloc( NULL , nBytes , MEM_COMMIT | MEM_RESERVE , PAGE_READWRITE );

if ( !base )

{ cout << "Virtual allocation is failed\n"; cout << "Error: " << GetLastError(); return 1; }

It was tested on the system having 32GB RAM + 3GB (min) - 8GB(max) page file. I expected that I could allocated something close to 40GB without problems.
The first run fails ( Error code is exactly the same 1455 ). Windows shows that page file size remains 3 GB.
If I increase minimum for page file to 8GB the test works with no problems. Now Windows meets my expectations.
So the recommendation from the previous post to increase page file minimum looks like the only option.

Comment by Nick Judson [ 18/Mar/15 ]

I also experienced this under MMapV1 with a system-set minimum swap size. Since increasing my swap, I haven't seen it. I've since switched to wiredTiger and haven't had the problem since (without needing to change vm settings). I'd recommend, if possible, trying that out.

Comment by sam Jones [ 18/Mar/15 ]

Another day, another dead mongodb server.... lost data. This is very old, folks...

Comment by ievgen [ 17/Mar/15 ]

I can reproduce this issue only when smallfiles is set to true.

Comment by Rahim Alizada [ 17/Mar/15 ]

Update: 3.0 with mmapv1 is affected too

Comment by AndrewK [ 17/Mar/15 ]

I am truly disappointed by the lack of attention this issue has received. If the Windows build had any credibility within the team, this issue would have been addressed by now as it's critical to the general stability of the product and seems to be affecting a good number of people. It's been marked as Critical for almost 2 years with nothing more than lip service being paid during that time. Mark Benvenuto, Andy Schwerin, Mathias Stearn, Ramon Fernandez et al: Gentlemen, there comes a time where you either need to pony up the time to address things like this, or publicly state "we don't intend fixing it" and close it - and then put a bit fat sticker on the download page alongside the Windows links saying "may cause indigestion / is not production ready". I understand the fact that at the end of the day, Mongo is free, but please don't just leave tickets languishing like this. It does no good to your community. Address them or close them.

Comment by Rahim Alizada [ 26/Feb/15 ]

I have been experiencing this issue on my dev machine all day while testing my app.
Here are some details I think might be relevant:

Win 8.1 x64, physical machine (not a VM)
8Gb ram
Page file min=max=6Gb
MongoDB 2.6.7 x64 2008plus version.
DB size approx 22Gb
SSD, free space more than 10Gb

I have a collection with approx 70_000 entries with binary data fields (image data, 150-300Kb per each entry), no additional indexes, the _id filed is a String.
The DB is accessed via Morphia.

Here are the operations that cause exception in my case:

  • restart MongoDB
  • MongoDB uses very little memory;
  • start the application
  • app queries that collection and retrieves only _id fields;
  • app checks and compares retrieved _id values and comes up with _ids that need to be deleted (initially around 800 entries);
  • app tries to delete entries by _ids (bulk) and MongoDB throws that exception.
  • if app tries to delete entries one by one a couple of dozen of entries get deleted but then the exception is thrown again.
  • at delete operation mongod working set increases to 6 Gb.

Same on 2.6.8

Comment by Yan [ 26/Jan/15 ]

Reproduced on 2.6.7, Win Server 2008 Ent, x64

Comment by ievgen [ 06/Dec/14 ]

When I set 16 gig as the initial and maximum size of the swap it never crashing again.
But I had a significant deterioration in performance - about 200 ms. for inserts.

Comment by Nick Judson [ 05/Dec/14 ]

Even increasing the swap size only delayed the error. It appears that high-volume concurrent bulk inserts causes this in my case.

Comment by Nick Judson [ 05/Dec/14 ]

I get this same error windows 8.1 SSD (no VM) on 2.6.4. 2.8rc1 panics and becomes unusable. As mentioned above, I would expect mongo to slow down, but not crash.

I can repro this consistently within 30 seconds of starting mongod. I'm about to try changing the swap settings.

Comment by Andrei Saprykin [ 25/Nov/14 ]

Question to levgen. When you increased swap size did you increase initial or max?

Comment by ievgen [ 25/Nov/14 ]

My server is VM with Windows Server 2008 R2, has 4 Gig RAM and custom swap size (initial 6142, max 16384).

My application restores data from a backup (makes mass removal and insertion).
Reads data from the file (22 gig) and then replaces the corresponding data in the database (size 30 gig).
After about 1 hour of intensive operations mongod crashes.
An increase swap size up to 25 GB does not helps.

This error from the production system: MongoDB v2.6.5 - http://pastebin.com/raw.php?i=zjCbgTyP

As requested in this message: https://jira.mongodb.org/browse/SERVER-10044?focusedCommentId=735943&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-735943
I have reproduced the same error in the latest version v2.8.0-rc1 (log http://pastebin.com/raw.php?i=4ivTrE9n )
Here is a ProcessExplorer perfromance graph: http://i57.tinypic.com/69os9y.png
In picture high memory usage began in 14:21:23 and keep 23 minutes (until 14:44:00) then mongod crashed.

Comment by sam Jones [ 18/Nov/14 ]

This has been hitting me over an over.
Single server, Win2012

2014-09-20T22:32:52.482-0500 [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
2014-09-20T22:32:52.482-0500 [initandlisten] db version v2.6.4
2014-09-20T22:32:52.482-0500 [initandlisten] git version: 3a830be0eb92d772aa855ebb711ac91d658ee910
2014-09-20T22:32:52.482-0500 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_49
2014-09-20T22:32:52.482-0500 [initandlisten] allocator: system

This is killing us. We spend hours a month, and we have lost data.
When is a fix coming?

Comment by Andrei Saprykin [ 18/Nov/14 ]

Hi,

I have the same problem running MongoDB 2.6.4.
Both servers are VMs running Windows 7 SP1 64-bit. 8GB RAM on each.
server1: mongos, config server, mongod
server2: mongod
Both shards occupy 42GB of disk and it has not changed for many days.
Shard on server 2 dies after approx 24 hours. It happened at least 3-4 days in a raw.
Page file on server 2 was 16GB min/24 GB max. For server 1 it is 12GB min/24GB max.

After mongod death with:
2014-11-17T22:04:53.851+0100 [conn2] request split points lookup for chunk xdm.hit { : -4936549356921932270 } -->> { : -4918894120284504670 }
2014-11-17T22:04:59.335+0100 [conn8] VirtualProtect for C:/data/db/xdm.17 chunk 7489 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 67108864, address is 7504000000) in mongo::makeChunkWritable, terminating
2014-11-17T22:04:59.335+0100 [conn8] xdm.hit Fatal Assertion 16362
Page file remained 16GB ( surprise!!! ). Page file is also at min on the server1. Disk has more then 20GB free space on both systems.

Workload: massive insert each 10 min. Massive delete each 10 min shifted by 7 min from insert. Occasional long time running querries.

Comment by Sergio Jiménez Benito [ 13/Nov/14 ]

Hi,

I have prepared a non virtualized 2-node replica set with the following characteristics:

Intel Xeon E5-2407 (2.20Ghz)
32GB RAM
2TB HDD RAID 5 (15k)
Windows Server 2012 R2 Standard
MongoDB 2.6.5

I still have the same problem after 24-48 hours
I tried to resynchronize a node from scratch and it stop with the same error after 5 minutes.

2014-11-07T13:12:02.061Z [repl writer worker 13] VirtualProtect for D:/MongoDB/Data/XXX/XXX.190 chunk 16679 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 67108864, address is 1049c000000) in mongo::makeChunkWritable, terminating

Comment by Chad Kreimendahl [ 08/Oct/14 ]

It would be interesting to know what this impacts windows specifically and not unix systems. We've got both, and in both we typically provide substantial physical memory and shut off swap almost entirely... and in unix set the swappiness to 1. Why would a swap-less unix system not do the same thing, given the same hardware and databases?

Comment by Sergio Jiménez Benito [ 08/Oct/14 ]

Hi!

I have the same problem since few months ago with a virtual machine. I prepared a physical machine to try to resolve the problem but the database is still failing after working during some time. The virtual memory usage grows until MongoDB process dies.

http://s15.postimg.org/dyziwaknf/Memory.png

PD: I am not using SDDs.

Comment by Ben Carrier [X] [ 08/Oct/14 ]

@AndrewK we are using SSDs.

Comment by AndrewK [ 08/Oct/14 ]

@Ben, are you using SSDs? I ask only because I've never seen this issue on non-virtualised environments (which have always had SSDs) - even when the swap file has been seriously constrained. Not to say that means it shouldn't happen, but might shed more light on where the issue is.

Comment by Ben Carrier [X] [ 08/Oct/14 ]

Hi Ramon, I reported in an earlier comment that this happens to us as well. We can report it happens in Windows 7 and Windows Server 2012 on physical boxes. Not virtualised systems. So your assumption is wrong that it only happens on virtualised systems.

Some haste given to this Major issue would be greatly appreciated. I have been following this bug for quite a while now and all you've done is add some extra error messages.

Best regards,

Ben

Comment by sam Jones [ 08/Oct/14 ]

Ramon and team,

Allow me to make a small "from a customer" observation.

I can handle it if Mongo slows down. I can handle it (in my application) if Mongo drops some writes because it is too busy (if this were to happen, it would be nice if it were logged). I can handle it if Mongo feels it needs more resources than I have given it (please give a detailed log entry explaining why resources are thin, what element is consuming the most RAM,... help me help myself).

But I can't handle a persistence layer that just.... dies.

Nothing in my stack dies. Not mysql. Not sql server. Not lucene. Not elasticsearch. Not IIS. Not CentOS. Not Ubuntu. Not Windows. Not Ncache. Not redis. Not VMWare.

But mongo... it dies.

And look at it from my chair. My old mongo server (running whatever was current a few years ago) ran in a VM on a single hard drive (not even mirrored). It actually corrupted the db once in awhile (which, in my app, I didn't care about), but it never went down. Now I run the current release, on a VM (everything I have is a VM, please don't ask for metal, I don't have any!), on a big RAID10 spindle set, with triple the RAM I had before (and, admittedly, a DB that is 2x or 3x larger), and Mongo dies. Mongo is fast now (it used to be horribly slow). But sometimes it dies.

When I first set up this server, it was dying within the first hours. Then, with a large paging file, it only dies every couple of months.

And when it dies, some sysadmin gets an alert, and goes digging around to deal with it, and an incident report gets written. And 10 or more people know "Mongo terminated for no – to us – sensible reason, but it is running again now."

Thank you for understanding.

Comment by Ramon Fernandez Marina [ 07/Oct/14 ]

All,

as explained above by Mathias and Mark, this issue is most likely caused by a page file that's not sufficiently large to hold all the pending writes in the current working set. Since this issue is only observable in virtual machines so far, it is very possible that the source of the problem is that growing the page file to accommodate mongod's memory needs takes too long on virtualized environments, thus prompting the OS to kill applications consuming large amounts of memory.

Increasing the size of the page file should help as long as the new size can hold the working set, so the page file does not need to be grown at any time. Lowering the value of journalCommitInterval or storage.journal.commitIntervalMs will help reduce the amount of memory the OS must set aside for dirty memory pages.

Starting on version 2.7.1, mongod logs additional information when killed by the operating system on out-of-memory conditions. If someone can reproduce the issue on a testing environment with 2.7.1 the additional logs may help clarify the situation.

Regards,
Ramón.

Comment by sam Jones [ 05/Oct/14 ]

We had this happen just now:
v2.6.4 on Win2012 w 30GB RAM, paging file is 12gb min/25gb max

Not pretty!

>>>>>>>>>>>
2014-10-01T00:24:05.664-0500 [conn3421] insert dotnetwrapper_mongodb.LogItem ninserted:1 keyUpdates:0 numYields:0 locks(micros) w:1814629 1818ms
2014-10-01T00:24:05.914-0500 [conn3419] VirtualProtect for c:/MongoDB/Data/dotnetwrapper_mongodb/dotnetwrapper_mongodb.309 chunk 46821 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 67108864, address is 2db94000000) in mongo::makeChunkWritable, terminating
2014-10-01T00:24:05.914-0500 [conn3419] dotnetwrapper_mongodb.LogItem Fatal Assertion 16362
2014-10-01T00:24:08.082-0500 [conn3419] mongod.exe ???

Comment by AndrewK [ 24/Jul/14 ]

Ok - was hoping you were going to say much less RAM - as that is the only way I've been able to repro this, with about 4GB. As I said before, we're back non-VM and haven't had issues.

Comment by Christos Manios [ 24/Jul/14 ]

@AndrewK - Our virtual machines run with 10GB of RAM with 5 CPU, in Windows 2012 R2 64bit using Hyper V as virtualisation environment. MongoDB is configured in replica set mode with GridFS enabled. Pagefile size is currently set to "system managed" and has allocated 30482MB of HDD.

Also I noticed the following while observing memory usage via SysInternals Process Explorer (procexp.exe):
1. When mongod.exe is started with large pages enabled (Windows default), system commit is very low, while physical memory progressively reaches limits depending on database usage.
2. When mongod.exe is started with large pages disabled, system commit progressively reaches limits and after that physical memory progressively reaches limits depending on database usage.

Comment by AndrewK [ 24/Jul/14 ]

@christos - how much RAM did this VM have?

Comment by Christos Manios [ 24/Jul/14 ]

I experience the same problem running MongoDB 2.6.0 on Windows 2012 R2 host using Hyper V.
I tried the following but none of them seems to solve the issue:

1. Set pagefile to custom size.
2. Set pagefile size to system managed.
3. Disabling large pages inside Windows host using gpedit.msc.(Computer Configuration>Windows Settings>Security Settings>Local Policies>User Rights Management>Lock Pages in memory)

The error shown in mongo.log is the following:
2014-07-23T15:20:18.295+0300 [conn20430] VirtualProtect for C:/mongodb/data/db/artifact2.5 chunk 186214 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 67108864, address is b5d98000000) in mongo::makeChunkWritable, terminating
2014-07-23T15:20:18.326+0300 [conn20430] artifact2.docs Fatal Assertion 16362
2014-07-23T15:20:21.951+0300 [conn20445] end connection 192.168.1.1:54482 (780 connections now open)
2014-07-23T15:20:21.967+0300 [conn20444] end connection 192.168.1.1:54481 (779 connections now open)
2014-07-23T15:20:22.232+0300 [conn20430] mongod.exe ???
2014-07-23T15:20:22.232+0300 [conn20430]

***aborting after fassert() failure

Comment by Ben Carrier [X] [ 05/Jun/14 ]

We are experiencing the same issue:

[conn6550] VirtualProtect for C:/data/db/SignalDB.4 chunk 4149 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 67108864, address is 40d4000000) in mongo::makeChunkWritable, terminating

Our issue is experiences when running MongoDB 2.4.8 and Windows 7.

Comment by Githook User [ 22/May/14 ]

Author:

{u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}

Message: SERVER-10044: Add more diagnostic information on OOM
Branch: master
https://github.com/mongodb/mongo/commit/7a85cae38fba537980d6c9fc573ef077fd2df74e

Comment by AndrewK [ 21/May/14 ]

Thanks for the info Mark. A few things:

1. increasing the pagefile, massively in some cases, has not resolved the issue for us
2. having an initially large page file may make a difference - I can't remember if we ever had it set to it's maximum size and saw a crash
3. we will consider trying this

during your tests:
4. how many files did you have to create before the crash?
5. how much physical RAM was available to this test environment?
6. did you see the crash at the same file number each time? we don't tend to see the crash until around 150GB of files have been allocated. as we only have 4GB RAM, I wonder if windows is running out of RAM with regard to memory that can't be paged to disk (since we don't seem to have any improvement even with a massive page file)

Comment by Mark Benvenuto [ 14/May/14 ]

I was able to repro your issue on my Windows 8.1 machine. I created a test case that emulates MongoD behavior by trying to create N Read-Only 256 MB memory mapped files in memory, and then trying to call VirtualProtect to mark them as Read Write. This triggered the same 1455 error, and I was able to investigate the GlobalMemoryStatusEx for page file free memory, and Windows reported 227 MB which was not enough to fit a 256 MB file.

I would recommend a few things:

  1. Increase the page file size
  2. Ensure the page file size has a higher initial size as jdschepper suggested.
  3. Reduce the --syncDelay to 45 seconds as this will reduce the commit usage of MongoDB. The downside of this change is that MongoD will pause more often as it background flushes.
Comment by Sergio Jiménez Benito [ 12/May/14 ]

The database stopped again this weekend (10/05/2014 15:00). These are the MMS graphs: http://oi57.tinypic.com/2yyw682.jpg

I just activate logman to register the next stop.

Comment by Jacques De Schepper [ 12/May/14 ]

If this may help you, I will relay our experiences with this exact same issue.

We experienced this on a customer install (VMWare with Windows Server 2008 R2 Enterprise). The machine had tons of disk, though only 8GB of RAM. The Virtual Memory settings were custom set to:

  • Initial Size 2GB
  • Max Size 64GB

The problem kept occurring daily. Reading up a bit it was conjectured that the error might occur while resizing the page file. So I changed the VM settings to:

  • Initial Size 32GB
  • Max Size 64GB

And we haven't had the problem since even though the database has grown quite a bit since. We locally run a test setup to simulate their configuration, never had a problem there... The local setups doesn't run a VM though. Which makes me believe the crash occurs (as suggested somewhere earlier in this thread) when the system is temporarily refusing allocation because he is resizing the pagefile. I am by no means an expert on Windows OS, but that's the behaviour we observed. If you want additional info or want me to run additional tests, feel free to let me know.

Comment by Mark Benvenuto [ 09/May/14 ]

My running hypothesis is still that your are running out of memory, and need to increase the physical memory and/or page file size on the mahcine.

logman delete mongodb_log
logman create counter mongodb_log -c "\Memory\*" "\Process(mongod)\*" -si 1  -f bincirc -ow -max 10
logman start mongodb_log

This will create a circular 10MB log file that will capture memory info, and hopefully give me more information into what is happening. Send me the log file after mongod.exe crashes, and it should show the memory usage when it crashes. The log file will never grow about 10MB in this case.

Comment by Sergio Jiménez Benito [ 09/May/14 ]

That is all. I don't understand why it doesn't return all values. I am using Windows Server 2008 R2.

This is the server configuration: http://oi60.tinypic.com/2vxf7rk.jpg

Comment by AndrewK [ 09/May/14 ]

Serigo, you're missing values in that output. Can you rerun the script?

Comment by Sergio Jiménez Benito [ 09/May/14 ]

Physical Mem: 5119 MB
PageFile Location:
PageFile Allocated Size: MB
PageFile Current Size: MB
PageFile Peak Size: MB

I'm using MMS.

Comment by AndrewK [ 07/May/14 ]

We ran this on a VM just after the process had crashed (see below for output from a non-crashed node).

Also, these crashes only ever happen on a secondary - never a primary.

Secondary shortly after crash:

PS C:\Users\Administrator> $server = ".";
PS C:\Users\Administrator>
PS C:\Users\Administrator> $PhysicalMem = gwmi -computer $server Win32_ComputerSystem | %

{[Math]::round($_.TotalPhysicalMemory/1MB,0)}

PS C:\Users\Administrator> $PF = gwmi -computer $server Win32_PageFileUsage
PS C:\Users\Administrator>
PS C:\Users\Administrator> Write-Host "Physical Mem: $PhysicalMem MB"
Physical Mem: 4096 MB
PS C:\Users\Administrator> Write-Host "PageFile Location: $($PF.Name)"
PageFile Location: C:\pagefile.sys
PS C:\Users\Administrator> Write-Host "PageFile Allocated Size: $($PF.AllocatedBaseSize) MB"
PageFile Allocated Size: 10240 MB
PS C:\Users\Administrator> Write-Host "PageFile Current Size: $($PF.CurrentUsage) MB"
PageFile Current Size: 322 MB
PS C:\Users\Administrator> Write-Host "PageFile Peak Size: $($PF.PeakUsage) MB"
PageFile Peak Size: 837 MB

-------------

Secondary which has been active for a while:

Windows PowerShell
Copyright (C) 2012 Microsoft Corporation. All rights reserved.

PS C:\Users\Administrator> $server = ".";
PS C:\Users\Administrator>
PS C:\Users\Administrator> $PhysicalMem = gwmi -computer $server Win32_ComputerSystem | %

{[Math]::round($_.TotalPhysicalMemory/1MB,0)}

PS C:\Users\Administrator> $PF = gwmi -computer $server Win32_PageFileUsage
PS C:\Users\Administrator>
PS C:\Users\Administrator> Write-Host "Physical Mem: $PhysicalMem MB"
Physical Mem: 4096 MB
PS C:\Users\Administrator> Write-Host "PageFile Location: $($PF.Name)"
PageFile Location: C:\pagefile.sys
PS C:\Users\Administrator> Write-Host "PageFile Allocated Size: $($PF.AllocatedBaseSize) MB"
PageFile Allocated Size: 10000 MB
PS C:\Users\Administrator> Write-Host "PageFile Current Size: $($PF.CurrentUsage) MB"
PageFile Current Size: 452 MB
PS C:\Users\Administrator> Write-Host "PageFile Peak Size: $($PF.PeakUsage) MB"
PageFile Peak Size: 867 MB

Comment by Mark Benvenuto [ 06/May/14 ]

Can you run the following script in powershell, and send me the output:

$server = ".";
 
$PhysicalMem = gwmi -computer $server Win32_ComputerSystem | % {[Math]::round($_.TotalPhysicalMemory/1MB,0)} 
$PF = gwmi -computer $server Win32_PageFileUsage 
 
Write-Host "Physical Mem: $PhysicalMem MB"
Write-Host "PageFile Location: $($PF.Name)"
Write-Host "PageFile Allocated Size: $($PF.AllocatedBaseSize) MB"
Write-Host "PageFile Current Size: $($PF.CurrentUsage) MB"
Write-Host "PageFile Peak Size: $($PF.PeakUsage) MB"

Comment by AndrewK [ 06/May/14 ]

Sergio, are you running gridfs?

Comment by AndrewK [ 06/May/14 ]

Sorry, more info, page file has been various sizes up to 20GB, physical was 200GB (under the VM) and then 800GB on the host. Database was about 140GB if I remember correctly. RAM was 4GB.

Comment by AndrewK [ 06/May/14 ]

Mark, increasing the page file has no effect. As per the comments we've got up to 10GB and since then we've tried 20GB. It doesn't fix the problem. Again, I've run systems where the size of the paging file has been well under 10GB without issues - but this was not on a VM.

Comment by Mark Benvenuto [ 06/May/14 ]

What is the size of your page file, your physical ra, and your database? You should be able to fix this by simply increasing the size of your page file in this case.

http://windows.microsoft.com/en-us/windows/change-virtual-memory-size#1TC=windows-7

Background
This error occurs when mongod calls VirtualProtect to prepare a write, and forces Windows to commit the memory that was previously backed by a file mapping thereby causing the system commit charge to exceed available memory. In order for Windows to successfully acknowledge a virtual memory commit, it needs to find enough free memory to satisfy this request between physical ram, and the page file which it failed to do here. Note that even though Windows is asking for memory from the page file, it does not mean that it will actually use the page file, it wants it just in case it may. Windows is simply conservative in how it commits Copy On Write pages since it does not want to deal with out of memory errors when creating page copies during writes.

When copy-on-write access is specified, the system and process commit charge taken is for the entire view because the calling process can potentially write to every page in the view, making all pages private.

See FILE_MAP_COPY in http://msdn.microsoft.com/en-us/library/windows/desktop/aa366761(v=vs.85).aspx

Comment by AndrewK [ 06/May/14 ]

Bad news, we were using Hyper-V

Comment by Sergio Jiménez Benito [ 06/May/14 ]

What virtualization softare were using? I'm going to try luck with Hyper-V over Windows Server 2012 as a last option.

Comment by AndrewK [ 06/May/14 ]

Sergio, we've resorted to running outside of VMs because mongod.exe simply wasn't stable. All issues are now resolved. It's highly frustrating that this hasn't been resolved as running outside of VMs has real cost and management implications.

Comment by Sergio Jiménez Benito [ 05/May/14 ]

Hi AndrewK, I have exactly the same problem with a Windows Server 2008 R2 x64 running on a VMWare ESXi cluster.

Sat May 03 07:55:46.758 [conn171428] VirtualProtect for D:/db/MOURA_RT/MOURA_RT.34 chunk 4715 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 19660800, address is 49aed40000) in mongo::makeChunkWritable, terminating
Sat May 03 07:55:46.773 [conn171428] MOURA_RT.1f66f9e3-08b7-e211-99e1-005056be00c4 Fatal Assertion 16362
Sat May 03 07:55:48.273 [conn171428] mongod.exe ???
Sat May 03 07:55:48.273 [conn171428] mongod.exe ???
Sat May 03 07:55:48.273 [conn171428] mongod.exe ???

Comment by AndrewK [ 08/Nov/13 ]

Mathias, thanks for the details. How many private views are active at a time? What is the recommended page file size in this case?

I've been running mongodb since 2009 with numerous configurations and database sizes (GB to TB) and have never had this issue before. The only difference is that this cluster's nodes are running in VMs.

Comment by Mathias Stearn [ 07/Nov/13 ]

andrewk, Mongod asks windows for a "private view" of portions of files that it is writing to. This is intended to allow us to make changes in memory without the risk that any intermediate state will be written to the file before we have written that change to the journal. On windows, this means that the OS needs to reserve enough room to page out the entire view (a subset of the file) to the swap file. Linux is less strict about that due to the ability to over-commit memory with the associated advantages and disadvantages.

Comment by AndrewK [ 07/Nov/13 ]

@Maxime.Beucher - were you running in a VM guest when you had this issue originally?

we have had more crashes across a number of instances in our replica set where have a 10GB page file. at the time of the crash, we see the following in the event logs:

This is logged at the exact same time as the crash:

Application popup: Windows - Virtual Memory Minimum Too Low : Your system is low on virtual memory. Windows is increasing the size of your virtual memory paging file. During this process, memory requests for some applications may be denied. For more information, see Help.

This is logged 2 seconds later:

Log Name: System
Source: Microsoft-Windows-Resource-Exhaustion-Detector
Date: 07/11/2013 03:31:41
Event ID: 2004
Task Category: Resource Exhaustion Diagnosis Events
Level: Warning
Keywords: Events related to exhaustion of system commit limit (virtual memory).
User: SYSTEM
Computer: mongo02
Description:
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: mongod.exe (1600) consumed 15959318528 bytes

There is nothing else running on these boxes and at the time of the crash, the largest memory consuming process had 42mb of RAM.

It appears Mongo is asking the OS for more RAM than is physically available and thus getting paged - which is silly since the memory it needs is mapped to file - and thus paged already, no?

Comment by AndrewK [ 17/Oct/13 ]

Dan, we've had another crash. What should I be looking at in VMMap - remembering that you need a running process to see what is going on. Mongo can run for a days before it crashes which makes it hard to catch just before it goes down.

Also, we did increase the page file significantly. What's odd is that the other replica members have the same settings and are not crashing.

Comment by Daniel Pasette (Inactive) [ 28/Aug/13 ]

Hi andrewk, as mentioned earlier, have you tried increasing your page file size or looked at VMMap to determine what is causing high page file consumption?

Comment by AndrewK [ 27/Aug/13 ]

Andy - can you please comment on where you are with this bug? thanks!

Comment by AndrewK [ 22/Aug/13 ]

We've had another crash. Please can someone respond to this as it's now making things rather unstable.

Thu Aug 22 21:21:44.323 [repl writer worker 3] VirtualProtect for C:/xxxx/data/xxxx.8 chunk 9982 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 32374784, address is 9bfa120000) in mongo::makeChunkWritable, terminating
Thu Aug 22 21:21:44.323 [repl writer worker 3] xxxx.fs.chunks Fatal Assertion 16362
Thu Aug 22 21:21:44.416 [rsBackgroundSync] Assertion failure md src\mongo\util\net\message_port.cpp 195
Thu Aug 22 21:21:44.713 [initandlisten] connection accepted from 10.56.198.6:51846 #9 (3 connections now open)
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] mongod.exe ???
Thu Aug 22 21:21:44.838 [repl writer worker 3] KERNEL32.DLL ???
Thu Aug 22 21:21:44.838 [repl writer worker 3]

***aborting after fassert() failure

Comment by AndrewK [ 12/Aug/13 ]

I've resynced the replica from scratch and there have been no issues so far.

Comment by AndrewK [ 12/Aug/13 ]

We're still seeing almost instant crashes even with a larger paging file.

Comment by AndrewK [ 12/Aug/13 ]

I'm getting the same issue in a replica set. Windows 2012 Standard, 8GB RAM, running as a VM under Hyper-V. MongoDB 2.4.5 as a secondary in a replica set. Machine has 4GB RAM (yes I know this is constrained) with an OS managed swapfile - which for some reason was only 786MB. This has now been set to 4GB and we'll see if there is a another crash.

Mon Aug 12 09:00:10.301 [repl writer worker 1] VirtualProtect for xxxxxxxxxxx.18 chunk 10873 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 67108864, address is a9e4000000) in mongo::makeChunkWritable, terminating
Mon Aug 12 09:00:10.301 [repl writer worker 1] xxxxxxxxx.fs.chunks Fatal Assertion 16362
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe mongo::printStackTrace+0x3e
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe mongo::fassertFailed+0x43
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe mongo::makeChunkWritable+0x342
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe mongo::MemoryMappedFile::makeWritable+0x56
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe mongo::dur::DurableImpl::declareWriteIntent+0x69
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe mongo::dur::DurableImpl::writingPtr+0xf
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe mongo::NamespaceDetails::addDeletedRec+0x3e
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe mongo::DataFileMgr::_deleteRecord+0x309
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe mongo::DataFileMgr::deleteRecord+0x1c2
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe mongo::DataFileMgr::deleteRecord+0x14d
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe __ImageBase+0xffffffff
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe __ImageBase+0xffffffff
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe __ImageBase+0xffffffff
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe __ImageBase+0xffffffff
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe mongo::threadpool::Worker::loop+0x94
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe boost::`anonymous namespace'::thread_start_function+0x21
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe endthreadex+0x3b
Mon Aug 12 09:00:11.238 [repl writer worker 1] mongod.exe endthreadex+0xcf
Mon Aug 12 09:00:11.238 [repl writer worker 1] KERNEL32.DLL BaseThreadInitThunk+0x1a
Mon Aug 12 09:00:11.238 [repl writer worker 1]

***aborting after fassert() failure

Comment by Maxime Beucher [ 18/Jul/13 ]

It all makes sense to me now, thanks Tad.
I'll check our memory usage using VMMap as you advise.

Comment by Tad Marshall [ 17/Jul/13 ]

When journaling is enabled, mongod maps each database file into memory twice. Initially, these two mapped memory ranges are identical and do not require page file space because they are mapped to an existing disk file. When writes to the database take place, one of the mapped memory ranges is written to and the memory pages are copied and made private. On Linux, the "private view" is implicitly copy-on-write. On Windows, the "private view" is selectively made writable as pages are written to; this is the VirtualProtect call made by makeChunkWritable in the stack trace above. These pages about-to-be-modified require page file space. When the modified pages are later flushed to disk, the database file is remapped and the page file space is released to the system.

On Windows, you can use the VMMap program from Sysinternals to view the virtual memory of the mongod.exe process and see what type of memory usage is contributing to high page file consumption.

Comment by Maxime Beucher [ 17/Jul/13 ]

I was referring to the windows requirement. The fact that every page of RAM needs to be backed by a file on disk is understandable.
What I don't get is why MongoDB memory mapped files need the system pagefile so much.
Well I don't know the Windows memory mapping system enough to make assumptions but I still think that needing 51GiB of system pagefile for an overall DB size of 60GiB is a waste.

Comment by Yury Michurin [ 17/Jul/13 ]

Last night our server crashed with the same error, the machine is Win2008 R2 x64 with 8G ram running mongodb 2.4.1 for a month or so and that's the first time it crashes, maybe related: recently the machine is a bit low on disk, has ~10GB free disk space. I'll try the suggestion doubling the page file's size.

From the service.log:

Tue Jul 16 21:34:07.381 [conn8879732] VirtualProtect for C:/data/db/SMB_Statistics.8 chunk 11417 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 67108864, address is b264000000) in mongo::makeChunkWritable, terminating
Tue Jul 16 21:34:07.396 [conn8879732] SMB_Statistics.StatisticsData Fatal Assertion 16362
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\util\stacktrace.cpp(189) mongo::printStackTrace+0x3e
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\util\assert_util.cpp(131) mongo::fassertFailed+0x43
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\util\mmap_win.cpp(264) mongo::makeChunkWritable+0x342
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\util\mmap.h(267) mongo::MemoryMappedFile::makeWritable+0x56
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\db\dur_commitjob.cpp(127) mongo::dur::DurableImpl::declareWriteIntent+0x69
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\db\dur.cpp(235) mongo::dur::DurableImpl::writingAtOffset+0x18
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\db\btree.cpp(403) mongo::BucketBasics<mongo::BtreeData_V1>::basicInsert+0xd6
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\db\btree.cpp(1311) mongo::BtreeBucket<mongo::BtreeData_V1>::insertHere+0xa0
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\db\index_insertion_continuation.h(66) mongo::IndexInsertionContinuationImpl<mongo::BtreeData_V1>::doIndexInsertionWrites+0xda
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\db\index_update.cpp(148) mongo::indexRecordUsingTwoSteps+0x77a
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\db\pdfile.cpp(1767) mongo::DataFileMgr::insert+0xe3b
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\db\pdfile.cpp(1328) mongo::DataFileMgr::insertWithObjMod+0x55
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\db\instance.cpp(803) mongo::insertMulti+0x14a
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\db\instance.cpp(867) mongo::receivedInsert+0xaff
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\db\instance.cpp(441) mongo::assembleResponse+0x57a
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\db\db.cpp(194) mongo::MyMessageHandler::process+0xfa
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\mongo\util\net\message_server_port.cpp(207) mongo::PortMessageServer::handleIncomingMsg+0x56a
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(180) boost::`anonymous namespace'::thread_start_function+0x21
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(314) _callthreadstartex+0x17
Tue Jul 16 21:34:13.387 [conn8879732] mongod.exe f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(292) _threadstartex+0x7f
Tue Jul 16 21:34:13.387 [conn8879732]

***aborting after fassert() failure

Comment by Andy Schwerin [ 05/Jul/13 ]

Are you referring to the Windows requirement that every page of memory have identified storage space on disk, or to the volume of memory required by MongoDB?

Comment by Maxime Beucher [ 05/Jul/13 ]

So far, so good. We didn't experience a single crash since we doubled the page file size on Monday. Thanks.
Still I think that this requirement is a real problem.

Comment by Maxime Beucher [ 01/Jul/13 ]

Alright, we'll try to double the page file size.

Comment by Maxime Beucher [ 01/Jul/13 ]

Alright, we'll try to double the page file size.

Comment by Andy Schwerin [ 28/Jun/13 ]

Inspecting the "non-mapped virtual memory" graph for the host in question during the June 27 period of frequent crashing, from 6:00 GMT to 15:00 GMT, there is evidence of page file exhaustion. Using the 5-minute data view, which will only be available for another several hours for the time period in question, one can watch the "non-mapped virtual memory" climb to what looks like the process's stable level of 2 GB, then crash, and climb again. This might be aggravated by periods of updating data that is distributed over your whole data set, growing the writable working set.

Because Windows requires most pages of RAM be backed by a page on disk, you can actually run out of swap space during periods where there is little actual swapping occurring. If this hypothesis is correct, doubling the page file size should ameliorate the issue. Is that an experiment you can safely conduct in your environment?

Comment by Daniel Pasette (Inactive) [ 28/Jun/13 ]

This should have been fixed with associated fixes around SERVER-5663.
The mongod is shutting down by design when it hits this though, see: SERVER-6132.

Generated at Thu Feb 08 03:22:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.