[SERVER-5244] core suite fails with "not enough storage" error - Windows 32 bit Created: 07/Mar/12 Updated: 11/Jul/16 Resolved: 01/May/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | 2.1.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ian Whalen (Inactive) | Assignee: | Eric Milkie |
| Resolution: | Done | Votes: | 0 |
| Labels: | buildbot | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Nightly Windows 32-bit |
||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | Windows | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
http://buildbot.mongodb.org/builders/Nightly%20Windows%2032-bit/builds/791/steps/test_1/logs/stdio |
| Comments |
| Comment by auto [ 30/Apr/12 ] | ||||||||
|
Author: {u'login': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: The 32-bit Windows builder runs out of virtual address space | ||||||||
| Comment by Ian Whalen (Inactive) [ 23/Apr/12 ] | ||||||||
|
Appears that this problem is back: http://buildbot.mongodb.org/builders/Nightly%20Windows%2032-bit/builds/838/steps/test_1/logs/stdio | ||||||||
| Comment by Eric Milkie [ 29/Mar/12 ] | ||||||||
|
core suite is now passing. I just made sure that the tests with larger datasets didn't leave anything behind. | ||||||||
| Comment by auto [ 28/Mar/12 ] | ||||||||
|
Author: {u'login': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: | ||||||||
| Comment by auto [ 26/Mar/12 ] | ||||||||
|
Author: {u'login': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: | ||||||||
| Comment by Eric Milkie [ 26/Mar/12 ] | ||||||||
|
I watched a run of smokeJS with VMMap. It fails when it attempts to map in a 5th database file for db "test":
As can be seen above, just the 5th file alone consumes half a gig of virtual address space. Combined with the rest of the files, it's no wonder we are hitting this error on the 32-bit build. I don't know why this isn't problematic on Linux or OS X (perhaps we are getting close to running out?) Should we drop the "test" database after each js script is run for smokeJS? | ||||||||
| Comment by Ian Whalen (Inactive) [ 26/Mar/12 ] | ||||||||
|
Nightly 32-bit build still failing on the MapViewOfFile issue: http://buildbot.mongodb.org/builders/Nightly%20Windows%2032-bit/builds/807/steps/test_1/logs/stdio | ||||||||
| Comment by Eric Milkie [ 12/Mar/12 ] | ||||||||
|
I think all the collections are being dropped when we're done with them, but I could be mistaken. I think we're just getting hit by memory fragmentation. I've hacked up the tests enough now, to make all of the core tests pass again on 32-bit Windows. | ||||||||
| Comment by auto [ 12/Mar/12 ] | ||||||||
|
Author: {u'login': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: | ||||||||
| Comment by auto [ 10/Mar/12 ] | ||||||||
|
Author: {u'login': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: | ||||||||
| Comment by Tad Marshall [ 09/Mar/12 ] | ||||||||
|
I like your idea of improving the handling of a MapViewOfFile() failure. It's hard to say exactly what the "best" improvement would be, but the current tactic of reporting the error and failing the immediate action is not good enough. If an exception returned control to a place where things were cleaned up properly, we could keep running and a dropDatabase() on the database that triggered the exception might allow new database files to be mapped. I spotted three places where we use this API: 1) util/mmap_win.cpp, MemoryMappedFile::createReadOnlyMap(), line 71; I think the third one is the one used for journaling. We should definitely not be returning bad data ... it would be better to do a fatal shutdown than let that happen. For the core suite, are we dropping collections when we are done with them, and is the space being reused by later tests? If collections are being dropped and the space is not being reused, is that another bug we need to look at? If they are not being dropped, maybe we should, otherwise we have inter-test dependencies and changing behavior when tests are added, removed, moved or renamed. | ||||||||
| Comment by Eric Milkie [ 09/Mar/12 ] | ||||||||
|
We're further now, we make it to the u's before it runs out of storage. Should we consider throwing an exception when MapViewOfFile() fails? Right now we just blindly continue after logging what happened. This makes me nervous. Note that the unit test that is now failing is actually failing due to unexpected data!! This could result in queries returning wrong answers. | ||||||||
| Comment by auto [ 09/Mar/12 ] | ||||||||
|
Author: {u'login': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: | ||||||||
| Comment by Tad Marshall [ 09/Mar/12 ] | ||||||||
|
Assigning to Eric since he is working on it, reassign to me if you need to, thanks! | ||||||||
| Comment by auto [ 09/Mar/12 ] | ||||||||
|
Author: {u'login': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: This is in hopes of making all the core js tests pass on Windows 32-bit | ||||||||
| Comment by Tad Marshall [ 09/Mar/12 ] | ||||||||
|
Thinking about the problem some more, I suspect that what is happening is that we do not have enough contiguous virtual address space in the 32-bit mongod.exe process to map the next file in the test database set. The failure is on test.5, so we already have test.0 through test.4, but the push2.js test is trying to create a BSON object that is too large and so keeps creating bigger and bigger objects until it gets a failure. When this causes mongod.exe to need a new extent in a new file, it tries to create the file and map it and there isn't a block of contiguous address space big enough to hold the mapping. The big issue with 32-bit processes is not simply "memory", but address space. Windows reserves the top half of the address space for the kernel so there is only 2 GB of address space available for user processes. But all of the DLLs used by a process and a whole bunch that may not even be used are mapped into the user's half of the address space, and they are not necessarily placed optimally. If you look at your address space with VMMap or vadump.exe you can see that they are scattered around, and all of our memory mapped files have to fit into whatever contiguous blocks of address space are available. Even with lots of unused address space, there may not be a single contiguous block of address space large enough to hold a new memory mapped file. I think that the right thing to do now is to disable the push2.js test for 32-bit Windows. pushall.js is not the problem and it should pass on 32-bit Windows as-is if push2.js is not run. | ||||||||
| Comment by auto [ 08/Mar/12 ] | ||||||||
|
Author: {u'login': u'tadmarshall', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: Remove the diagnostic logging I added for trying to debug this issue. | ||||||||
| Comment by Tad Marshall [ 08/Mar/12 ] | ||||||||
|
We do not seem to be out of memory when the MapViewOfFile() API fails. Total mapped is 384 MB when we start push2.js and 768 MB after push2.js when we start pushall.js. We are barely touching the page file with 77 MB of it used. It is somewhat possible that the test against BSON size of 16 MB isn't working but that seems unlikely given that there are far fewer slow updates logged before the failing case than we get past in the succeeding case. The failing case displays "info DFM::findAll(): extent 2:6787000 was empty, skipping ahead. ns:test.push2" 7 times before the failure, but the succeeding case prints it 8 times. The push2.js test kicks virtual memory usage from 1148 MB up to 1533 MB and on a machine with 1738 MB RAM that's interesting, but in theory it shouldn't break Windows APIs. Next step may be to add diagnostics to the MapViewOfFile() failure and see what's going on there. We may be getting stuck on a slow disk subsystem: watching the code run by Remote Desktop into the AWS instance, CPU usage spends most of its time in the single digits ... we are waiting for the disk almost all the time. If waiting for memory-mapped file I/O can give a MapViewOfFile() error, maybe sleeping and retrying would get us past the error. Not solved yet. | ||||||||
| Comment by Tad Marshall [ 08/Mar/12 ] | ||||||||
|
It seems like something earlier in the tests must have put us in a bad state. I can log into the BuildBot machine and run pushall.js by hand and it works fine. Microsoft TechNet says that "Not enough storage is available to process this command" could be memory, page file, or disk space, Google suggests it could be lack of interrupt (IRP) stack space. I raised the page file to 5 GB and added diagnostics (db.hostInfo, db.serverStatus and db.stats) to see how bad the memory and test database size look on the next run. We're hitting test.5 when the MapViewOfFile() fails while pushall.js is pushing almost nothing, so it's not the pushing itself, it's the size of the memory-mapped database that's killing us. Also, push2.js is failing with the same MapViewOfFile() error but the test doesn't distinguish between the desired BSON size error and a Windows API failing. | ||||||||
| Comment by auto [ 08/Mar/12 ] | ||||||||
|
Author: {u'login': u'tadmarshall', u'name': u'Tad Marshall', u'email': u'tad@10gen.com'}Message: The 32-bit Windows BuildBot is showing signs of being out of |