[SERVER-10044] Repeated Server Crash. VirtualProtect failure in mongo::makeChunkWritable (Windows 7 x64). Created: 27/Jun/13 Updated: 19/Sep/15 Resolved: 07/May/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 2.4.6, 2.6.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Maxime Beucher | Assignee: | Geert Bosch |
| Resolution: | Won't Fix | Votes: | 18 |
| Labels: | VirtualProtect, Windows, crash, makeChunkWritable, server | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
MongoDB 2.4.4.2008+, Windows 7 64Bits, 8GB and 16GB RAM, SSD 500Go, i5-3550 CPU (3.30GHz). |
||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | Windows | ||||||||||||||||||||
| Steps To Reproduce: | Standard usage of the server with our configuration. Happens every days. |
||||||||||||||||||||
| Sprint: | Quint Iteration 3.1.2, Quint Iteration 3 | ||||||||||||||||||||
| Participants: |
Andrei Saprykin, AndrewK, Andy Schwerin, Ben Carrier [X], Chad Kreimendahl, Christos Manios, Daniel Pasette, Geert Bosch, Githook User, ievgen, Jacques De Schepper, Mark Benvenuto, Mathias Stearn, Maxime Beucher, Nathan Arthur, Nick Judson, Rahim Alizada, Ramon Fernandez Marina, sam Jones, Sergio Jiménez Benito, Tad Marshall, Yan, Yury Michurin
|
||||||||||||||||||||
| Description |
|
Our mongo servers keep crashing several times a day, often when the RAM usage on the server is high but not always. Our production server sometimes crash 6 times a day. This issue has been happening on two different hosts (a hardware cause is unlikely). We are monitoring our servers using MMS (We could provide a temporary user account if necessary). Here are some example of logs we encountered:
|
| Comments |
| Comment by Geert Bosch [ 11/May/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Actually you will be able to use the Azure instances, but you just need to specify a secondary pagefile on the remote volume. With Windows you can have multiple page files, and it is fine to have pagefiles on slow volumes. Windows will prefer using the faster files, and MongoDB will never actually need the pagefile in normal operation. Windows just requires it to be present in order for MongoDB to memory map files for writing. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nathan Arthur [ 08/May/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Geert, Thanks for the detailed update on this. It helps to have a better understanding of it. Our production systems (2.6.x) are routinely affected by this. So far we've already figured out to set the min and max page file size to the same thing, and we have empirically found that 16,000 MB minimizes the risk. (Based on earlier comments, we assumed that bigger wouldn't help.) But it still happens. So reading your comments - our system nearly has the behavior for which you suggest the fix might be useful (single thread / lots of small writes / different memory regions). We don't actually use a single thread, but we do all our work in small batches, and we write to many different collections concurrently. (It also seems like this problem tends to happen when there are large queries running - have you seen that?) The point is that blocking all our threads at once would be fine; the entire batch would just end up taking longer. Our system wouldn't create the "more active write operations" scenario. On the other hand, we do tend to see other memory-intensive services on the host also die when this happens, so maybe your fourth bullet is really the major issue. Still, though - can you reconsider whether to make this fix, maybe even if it had to be enabled explicitly somehow? Right now, we just have to admit to our customers that we chose a database that won't reliably stay up on Windows. That's a hard thing to say, and it certainly puts our customers off from MongoDB, as well. It would be much better if the operation just failed - we already have automatic recovery code for that type of thing. Then, looking at the Production Notes, they suggest a 32GB (or larger) page file. We do most of our hosting on Azure, and we've been experimenting with some of their new instances that provide higher disk performance - but the "D" drive (i.e. the locally-attached ephemeral storage) on those instances if very small - less than 32GB. So if we really do have to have at least a 32GB page file, that means we either have to put it on a slow (i.e. remote) volume, or that we can't use that type of instance. So here the message becomes "we can't use that type of environment because MongoDB prevents it". Finally, it seems like the underlying issue here is essentially a limitation of Windows (vs. Linux). There's nothing in the system that actually uses all that allocated storage, and yet we still have to allocate it to make it work. Have you done anything to push on Microsoft to change the limitation, in the long term? It would be nice to have some hope that eventually this can be truly fixed. Thank you! | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Geert Bosch [ 27/Apr/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The fix in We investigated the possibility to retry operations and recover from such errors after any dirty pages have been journaled and files have been remapped. This approach turned out to have a few fundamental problems. The real fix is to follow the newly added recommendations from the Production Notes on configuring the pagefile. Retrying has the following issues:
So, the case in which retrying with WCE is an actual solution is were there is a single thread with lots of small writes in many different memory regions. The downsides for the more realistic cases with multiple clients are such that they outweigh the limited benefits. In all cases, provisioning larger pagefiles and, most importantly, avoiding dynamic resizing, is a far better solution. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 23/Apr/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: Remove Remove writingPtr(), writing() and declareWriteIntent(). These | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 23/Apr/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: Revert " This reverts commit 878f4e1a3138ddfc0d7226e003bd48e56c8bbf8a. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 22/Apr/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 22/Apr/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: Remove Remove writingPtr(), writing() and declareWriteIntent(). These | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Geert Bosch [ 17/Apr/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have verified that the initial page file size is important, as calls that allocate virtual memory will fail with ERROR_COMMITMENT_LIMIT (1455) while the page file is being expanded. So, in the common and default case where the initial page file is far smaller than its maximum size, the error is transient and just waiting makes sense.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andy Schwerin [ 20/Mar/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Addressing the VirtualProtect failures may be possible in 3.0 and master by allow the individual write operations to fail via an exception. We can soak test this by adding a fail point to makeChunkWritable, much as we've done with WriteConflictException in WiredTiger. Actually, we might just be able to use WriteConflictException for this. No fix is possible on 2.6 or 2.4, I'm afraid. It is only the storage engine work from 3.0 that makes any solution tractable. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 20/Mar/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@Andrei Saprykin, what happens if you put your code in a retry loop - say retry X times after sleeping for Y? I'm not a C/C++ programmer otherwise I'd try it myself. Just wondering how long it takes for the virtual memory to be allocated. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Saprykin [ 20/Mar/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
https://github.com/mongodb/mongo/blob/eaa338b6668e0d7e1a9e1b7f20f04c2f33815d3c/src/mongo/db/storage/mmap_v1/aligned_builder.cpp calls Windows VirtualAlloc function I experimented with this function a little bit using the following code ( tries to allocate 36GB ) long long* array; LPVOID base = VirtualAlloc( NULL , nBytes , MEM_COMMIT | MEM_RESERVE , PAGE_READWRITE ); if ( !base ) { cout << "Virtual allocation is failed\n"; cout << "Error: " << GetLastError(); return 1; }It was tested on the system having 32GB RAM + 3GB (min) - 8GB(max) page file. I expected that I could allocated something close to 40GB without problems. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 18/Mar/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I also experienced this under MMapV1 with a system-set minimum swap size. Since increasing my swap, I haven't seen it. I've since switched to wiredTiger and haven't had the problem since (without needing to change vm settings). I'd recommend, if possible, trying that out. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by sam Jones [ 18/Mar/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Another day, another dead mongodb server.... lost data. This is very old, folks... | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by ievgen [ 17/Mar/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can reproduce this issue only when smallfiles is set to true. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rahim Alizada [ 17/Mar/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Update: 3.0 with mmapv1 is affected too | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 17/Mar/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I am truly disappointed by the lack of attention this issue has received. If the Windows build had any credibility within the team, this issue would have been addressed by now as it's critical to the general stability of the product and seems to be affecting a good number of people. It's been marked as Critical for almost 2 years with nothing more than lip service being paid during that time. Mark Benvenuto, Andy Schwerin, Mathias Stearn, Ramon Fernandez et al: Gentlemen, there comes a time where you either need to pony up the time to address things like this, or publicly state "we don't intend fixing it" and close it - and then put a bit fat sticker on the download page alongside the Windows links saying "may cause indigestion / is not production ready". I understand the fact that at the end of the day, Mongo is free, but please don't just leave tickets languishing like this. It does no good to your community. Address them or close them. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rahim Alizada [ 26/Feb/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have been experiencing this issue on my dev machine all day while testing my app. Win 8.1 x64, physical machine (not a VM) I have a collection with approx 70_000 entries with binary data fields (image data, 150-300Kb per each entry), no additional indexes, the _id filed is a String. Here are the operations that cause exception in my case:
Same on 2.6.8 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yan [ 26/Jan/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Reproduced on 2.6.7, Win Server 2008 Ent, x64 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by ievgen [ 06/Dec/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
When I set 16 gig as the initial and maximum size of the swap it never crashing again. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 05/Dec/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Even increasing the swap size only delayed the error. It appears that high-volume concurrent bulk inserts causes this in my case. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Judson [ 05/Dec/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I get this same error windows 8.1 SSD (no VM) on 2.6.4. 2.8rc1 panics and becomes unusable. As mentioned above, I would expect mongo to slow down, but not crash. I can repro this consistently within 30 seconds of starting mongod. I'm about to try changing the swap settings. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Saprykin [ 25/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Question to levgen. When you increased swap size did you increase initial or max? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by ievgen [ 25/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
My server is VM with Windows Server 2008 R2, has 4 Gig RAM and custom swap size (initial 6142, max 16384). My application restores data from a backup (makes mass removal and insertion). This error from the production system: MongoDB v2.6.5 - http://pastebin.com/raw.php?i=zjCbgTyP As requested in this message: https://jira.mongodb.org/browse/SERVER-10044?focusedCommentId=735943&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-735943 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by sam Jones [ 18/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This has been hitting me over an over. 2014-09-20T22:32:52.482-0500 [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2 This is killing us. We spend hours a month, and we have lost data. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Saprykin [ 18/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, I have the same problem running MongoDB 2.6.4. After mongod death with: Workload: massive insert each 10 min. Massive delete each 10 min shifted by 7 min from insert. Occasional long time running querries. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergio Jiménez Benito [ 13/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, I have prepared a non virtualized 2-node replica set with the following characteristics: Intel Xeon E5-2407 (2.20Ghz) I still have the same problem after 24-48 hours 2014-11-07T13:12:02.061Z [repl writer worker 13] VirtualProtect for D:/MongoDB/Data/XXX/XXX.190 chunk 16679 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 67108864, address is 1049c000000) in mongo::makeChunkWritable, terminating | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Chad Kreimendahl [ 08/Oct/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It would be interesting to know what this impacts windows specifically and not unix systems. We've got both, and in both we typically provide substantial physical memory and shut off swap almost entirely... and in unix set the swappiness to 1. Why would a swap-less unix system not do the same thing, given the same hardware and databases? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergio Jiménez Benito [ 08/Oct/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi! I have the same problem since few months ago with a virtual machine. I prepared a physical machine to try to resolve the problem but the database is still failing after working during some time. The virtual memory usage grows until MongoDB process dies. http://s15.postimg.org/dyziwaknf/Memory.png PD: I am not using SDDs. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ben Carrier [X] [ 08/Oct/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@AndrewK we are using SSDs. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 08/Oct/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@Ben, are you using SSDs? I ask only because I've never seen this issue on non-virtualised environments (which have always had SSDs) - even when the swap file has been seriously constrained. Not to say that means it shouldn't happen, but might shed more light on where the issue is. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ben Carrier [X] [ 08/Oct/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Ramon, I reported in an earlier comment that this happens to us as well. We can report it happens in Windows 7 and Windows Server 2012 on physical boxes. Not virtualised systems. So your assumption is wrong that it only happens on virtualised systems. Some haste given to this Major issue would be greatly appreciated. I have been following this bug for quite a while now and all you've done is add some extra error messages. Best regards, Ben | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by sam Jones [ 08/Oct/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ramon and team, Allow me to make a small "from a customer" observation. I can handle it if Mongo slows down. I can handle it (in my application) if Mongo drops some writes because it is too busy (if this were to happen, it would be nice if it were logged). I can handle it if Mongo feels it needs more resources than I have given it (please give a detailed log entry explaining why resources are thin, what element is consuming the most RAM,... help me help myself). But I can't handle a persistence layer that just.... dies. Nothing in my stack dies. Not mysql. Not sql server. Not lucene. Not elasticsearch. Not IIS. Not CentOS. Not Ubuntu. Not Windows. Not Ncache. Not redis. Not VMWare. But mongo... it dies. And look at it from my chair. My old mongo server (running whatever was current a few years ago) ran in a VM on a single hard drive (not even mirrored). It actually corrupted the db once in awhile (which, in my app, I didn't care about), but it never went down. Now I run the current release, on a VM (everything I have is a VM, please don't ask for metal, I don't have any!), on a big RAID10 spindle set, with triple the RAM I had before (and, admittedly, a DB that is 2x or 3x larger), and Mongo dies. Mongo is fast now (it used to be horribly slow). But sometimes it dies. When I first set up this server, it was dying within the first hours. Then, with a large paging file, it only dies every couple of months. And when it dies, some sysadmin gets an alert, and goes digging around to deal with it, and an incident report gets written. And 10 or more people know "Mongo terminated for no – to us – sensible reason, but it is running again now." Thank you for understanding. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 07/Oct/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
All, as explained above by Mathias and Mark, this issue is most likely caused by a page file that's not sufficiently large to hold all the pending writes in the current working set. Since this issue is only observable in virtual machines so far, it is very possible that the source of the problem is that growing the page file to accommodate mongod's memory needs takes too long on virtualized environments, thus prompting the OS to kill applications consuming large amounts of memory. Increasing the size of the page file should help as long as the new size can hold the working set, so the page file does not need to be grown at any time. Lowering the value of journalCommitInterval or storage.journal.commitIntervalMs will help reduce the amount of memory the OS must set aside for dirty memory pages. Starting on version 2.7.1, mongod logs additional information when killed by the operating system on out-of-memory conditions. If someone can reproduce the issue on a testing environment with 2.7.1 the additional logs may help clarify the situation. Regards, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by sam Jones [ 05/Oct/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We had this happen just now: Not pretty! >>>>>>>>>>> | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 24/Jul/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok - was hoping you were going to say much less RAM - as that is the only way I've been able to repro this, with about 4GB. As I said before, we're back non-VM and haven't had issues. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Christos Manios [ 24/Jul/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@AndrewK - Our virtual machines run with 10GB of RAM with 5 CPU, in Windows 2012 R2 64bit using Hyper V as virtualisation environment. MongoDB is configured in replica set mode with GridFS enabled. Pagefile size is currently set to "system managed" and has allocated 30482MB of HDD. Also I noticed the following while observing memory usage via SysInternals Process Explorer (procexp.exe): | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 24/Jul/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@christos - how much RAM did this VM have? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Christos Manios [ 24/Jul/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I experience the same problem running MongoDB 2.6.0 on Windows 2012 R2 host using Hyper V. 1. Set pagefile to custom size. The error shown in mongo.log is the following: ***aborting after fassert() failure | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ben Carrier [X] [ 05/Jun/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We are experiencing the same issue: [conn6550] VirtualProtect for C:/data/db/SignalDB.4 chunk 4149 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 67108864, address is 40d4000000) in mongo::makeChunkWritable, terminating Our issue is experiences when running MongoDB 2.4.8 and Windows 7. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 22/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'markbenvenuto', u'name': u'Mark Benvenuto', u'email': u'mark.benvenuto@mongodb.com'}Message: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 21/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the info Mark. A few things: 1. increasing the pagefile, massively in some cases, has not resolved the issue for us during your tests: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Benvenuto [ 14/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I was able to repro your issue on my Windows 8.1 machine. I created a test case that emulates MongoD behavior by trying to create N Read-Only 256 MB memory mapped files in memory, and then trying to call VirtualProtect to mark them as Read Write. This triggered the same 1455 error, and I was able to investigate the GlobalMemoryStatusEx for page file free memory, and Windows reported 227 MB which was not enough to fit a 256 MB file. I would recommend a few things:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergio Jiménez Benito [ 12/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The database stopped again this weekend (10/05/2014 15:00). These are the MMS graphs: http://oi57.tinypic.com/2yyw682.jpg I just activate logman to register the next stop. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jacques De Schepper [ 12/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
If this may help you, I will relay our experiences with this exact same issue. We experienced this on a customer install (VMWare with Windows Server 2008 R2 Enterprise). The machine had tons of disk, though only 8GB of RAM. The Virtual Memory settings were custom set to:
The problem kept occurring daily. Reading up a bit it was conjectured that the error might occur while resizing the page file. So I changed the VM settings to:
And we haven't had the problem since even though the database has grown quite a bit since. We locally run a test setup to simulate their configuration, never had a problem there... The local setups doesn't run a VM though. Which makes me believe the crash occurs (as suggested somewhere earlier in this thread) when the system is temporarily refusing allocation because he is resizing the pagefile. I am by no means an expert on Windows OS, but that's the behaviour we observed. If you want additional info or want me to run additional tests, feel free to let me know. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Benvenuto [ 09/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
My running hypothesis is still that your are running out of memory, and need to increase the physical memory and/or page file size on the mahcine.
This will create a circular 10MB log file that will capture memory info, and hopefully give me more information into what is happening. Send me the log file after mongod.exe crashes, and it should show the memory usage when it crashes. The log file will never grow about 10MB in this case. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergio Jiménez Benito [ 09/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
That is all. I don't understand why it doesn't return all values. I am using Windows Server 2008 R2. This is the server configuration: http://oi60.tinypic.com/2vxf7rk.jpg | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 09/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Serigo, you're missing values in that output. Can you rerun the script? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergio Jiménez Benito [ 09/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Physical Mem: 5119 MB I'm using MMS. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 07/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We ran this on a VM just after the process had crashed (see below for output from a non-crashed node). Also, these crashes only ever happen on a secondary - never a primary. Secondary shortly after crash: PS C:\Users\Administrator> $server = "."; PS C:\Users\Administrator> $PF = gwmi -computer $server Win32_PageFileUsage ------------- Secondary which has been active for a while: Windows PowerShell PS C:\Users\Administrator> $server = "."; PS C:\Users\Administrator> $PF = gwmi -computer $server Win32_PageFileUsage | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Benvenuto [ 06/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Can you run the following script in powershell, and send me the output:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 06/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sergio, are you running gridfs? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 06/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sorry, more info, page file has been various sizes up to 20GB, physical was 200GB (under the VM) and then 800GB on the host. Database was about 140GB if I remember correctly. RAM was 4GB. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 06/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Mark, increasing the page file has no effect. As per the comments we've got up to 10GB and since then we've tried 20GB. It doesn't fix the problem. Again, I've run systems where the size of the paging file has been well under 10GB without issues - but this was not on a VM. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mark Benvenuto [ 06/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
What is the size of your page file, your physical ra, and your database? You should be able to fix this by simply increasing the size of your page file in this case. http://windows.microsoft.com/en-us/windows/change-virtual-memory-size#1TC=windows-7 Background
See FILE_MAP_COPY in http://msdn.microsoft.com/en-us/library/windows/desktop/aa366761(v=vs.85).aspx | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 06/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Bad news, we were using Hyper-V | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergio Jiménez Benito [ 06/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
What virtualization softare were using? I'm going to try luck with Hyper-V over Windows Server 2012 as a last option. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 06/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sergio, we've resorted to running outside of VMs because mongod.exe simply wasn't stable. All issues are now resolved. It's highly frustrating that this hasn't been resolved as running outside of VMs has real cost and management implications. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergio Jiménez Benito [ 05/May/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi AndrewK, I have exactly the same problem with a Windows Server 2008 R2 x64 running on a VMWare ESXi cluster. Sat May 03 07:55:46.758 [conn171428] VirtualProtect for D:/db/MOURA_RT/MOURA_RT.34 chunk 4715 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 19660800, address is 49aed40000) in mongo::makeChunkWritable, terminating | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 08/Nov/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Mathias, thanks for the details. How many private views are active at a time? What is the recommended page file size in this case? I've been running mongodb since 2009 with numerous configurations and database sizes (GB to TB) and have never had this issue before. The only difference is that this cluster's nodes are running in VMs. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mathias Stearn [ 07/Nov/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
andrewk, Mongod asks windows for a "private view" of portions of files that it is writing to. This is intended to allow us to make changes in memory without the risk that any intermediate state will be written to the file before we have written that change to the journal. On windows, this means that the OS needs to reserve enough room to page out the entire view (a subset of the file) to the swap file. Linux is less strict about that due to the ability to over-commit memory with the associated advantages and disadvantages. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 07/Nov/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@Maxime.Beucher - were you running in a VM guest when you had this issue originally? we have had more crashes across a number of instances in our replica set where have a 10GB page file. at the time of the crash, we see the following in the event logs: This is logged at the exact same time as the crash:
This is logged 2 seconds later:
There is nothing else running on these boxes and at the time of the crash, the largest memory consuming process had 42mb of RAM. It appears Mongo is asking the OS for more RAM than is physically available and thus getting paged - which is silly since the memory it needs is mapped to file - and thus paged already, no? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 17/Oct/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Dan, we've had another crash. What should I be looking at in VMMap - remembering that you need a running process to see what is going on. Mongo can run for a days before it crashes which makes it hard to catch just before it goes down. Also, we did increase the page file significantly. What's odd is that the other replica members have the same settings and are not crashing. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 28/Aug/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi andrewk, as mentioned earlier, have you tried increasing your page file size or looked at VMMap to determine what is causing high page file consumption? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 27/Aug/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Andy - can you please comment on where you are with this bug? thanks! | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 22/Aug/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We've had another crash. Please can someone respond to this as it's now making things rather unstable. Thu Aug 22 21:21:44.323 [repl writer worker 3] VirtualProtect for C:/xxxx/data/xxxx.8 chunk 9982 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 32374784, address is 9bfa120000) in mongo::makeChunkWritable, terminating ***aborting after fassert() failure | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 12/Aug/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've resynced the replica from scratch and there have been no issues so far. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 12/Aug/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We're still seeing almost instant crashes even with a larger paging file. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by AndrewK [ 12/Aug/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm getting the same issue in a replica set. Windows 2012 Standard, 8GB RAM, running as a VM under Hyper-V. MongoDB 2.4.5 as a secondary in a replica set. Machine has 4GB RAM (yes I know this is constrained) with an OS managed swapfile - which for some reason was only 786MB. This has now been set to 4GB and we'll see if there is a another crash. Mon Aug 12 09:00:10.301 [repl writer worker 1] VirtualProtect for xxxxxxxxxxx.18 chunk 10873 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 67108864, address is a9e4000000) in mongo::makeChunkWritable, terminating ***aborting after fassert() failure | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maxime Beucher [ 18/Jul/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It all makes sense to me now, thanks Tad. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tad Marshall [ 17/Jul/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
When journaling is enabled, mongod maps each database file into memory twice. Initially, these two mapped memory ranges are identical and do not require page file space because they are mapped to an existing disk file. When writes to the database take place, one of the mapped memory ranges is written to and the memory pages are copied and made private. On Linux, the "private view" is implicitly copy-on-write. On Windows, the "private view" is selectively made writable as pages are written to; this is the VirtualProtect call made by makeChunkWritable in the stack trace above. These pages about-to-be-modified require page file space. When the modified pages are later flushed to disk, the database file is remapped and the page file space is released to the system. On Windows, you can use the VMMap program from Sysinternals to view the virtual memory of the mongod.exe process and see what type of memory usage is contributing to high page file consumption. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maxime Beucher [ 17/Jul/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I was referring to the windows requirement. The fact that every page of RAM needs to be backed by a file on disk is understandable. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yury Michurin [ 17/Jul/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Last night our server crashed with the same error, the machine is Win2008 R2 x64 with 8G ram running mongodb 2.4.1 for a month or so and that's the first time it crashes, maybe related: recently the machine is a bit low on disk, has ~10GB free disk space. I'll try the suggestion doubling the page file's size. From the service.log: Tue Jul 16 21:34:07.381 [conn8879732] VirtualProtect for C:/data/db/SMB_Statistics.8 chunk 11417 failed with errno:1455 The paging file is too small for this operation to complete. (chunk size is 67108864, address is b264000000) in mongo::makeChunkWritable, terminating ***aborting after fassert() failure | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andy Schwerin [ 05/Jul/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Are you referring to the Windows requirement that every page of memory have identified storage space on disk, or to the volume of memory required by MongoDB? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maxime Beucher [ 05/Jul/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
So far, so good. We didn't experience a single crash since we doubled the page file size on Monday. Thanks. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maxime Beucher [ 01/Jul/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Alright, we'll try to double the page file size. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Maxime Beucher [ 01/Jul/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Alright, we'll try to double the page file size. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andy Schwerin [ 28/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Inspecting the "non-mapped virtual memory" graph for the host in question during the June 27 period of frequent crashing, from 6:00 GMT to 15:00 GMT, there is evidence of page file exhaustion. Using the 5-minute data view, which will only be available for another several hours for the time period in question, one can watch the "non-mapped virtual memory" climb to what looks like the process's stable level of 2 GB, then crash, and climb again. This might be aggravated by periods of updating data that is distributed over your whole data set, growing the writable working set. Because Windows requires most pages of RAM be backed by a page on disk, you can actually run out of swap space during periods where there is little actual swapping occurring. If this hypothesis is correct, doubling the page file size should ameliorate the issue. Is that an experiment you can safely conduct in your environment? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 28/Jun/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This should have been fixed with associated fixes around |