[SERVER-15070] Unable to restart Windows mongod when filesize is close to virtual address space limit Created: 28/Aug/14 Updated: 14/Dec/15 Resolved: 14/Dec/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | 2.4.0-rc1, 2.4.11, 2.6.4, 2.7.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kevin Pulo | Assignee: | Mark Benvenuto |
| Resolution: | Done | Votes: | 3 |
| Labels: | community-team | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | Windows | ||||||||
| Sprint: | Platform 8 08/28/15, Platform 7 08/10/15, Platform 9 (09/18/15), Platform A (10/09/15), Platform B (10/30/15), Platform C (11/20/15), Platform D (12/11/15) | ||||||||
| Participants: | |||||||||
| Description |
|
In Windows, when the size of the data files is close to half the virtual address space limit, then the files can be initially opened, mapped and used (collection created and extents allocated) just fine. However, when the server is merely stopped and restarted (where extents have been allocated for a collection), it crashes with an inability to map the files. I've narrowed this issue down to a change between 2.4.0-rc0 and 2.4.0-rc1, though it still exists in 2.4.11, 2.6 and 2.7 (though it presents slightly differently in 2.6 and 2.7 than it does in 2.4). It looks like some of the data files might be somehow being mapped multiple times? Some of them are certainly unmapped several times. Maybe related to In Windows 2008 R2, the virtual address space limit for 64 bit user processes is 8TB. I've done all of this testing without journalling to simplify things, but when I was previously looking at it with journalling on, the situation was similar but with an effective limit of 4TB instead. The results are the same whether the "2008plus" or "legacy" win32 x64 builds are used. A workaround is to use Windows 2012 R2 instead of 2008 R2, where the limit is 128TB instead of 8TB. However, this problem will still affect Windows 2012 R2 for datasets around the 32TB mark (with journalling). By contrast, in Linux if I use "ulimit -v 10485760" to limit the virtual address space to 10GB, then all of these versions have the expected behaviour, ie. they are able to
Very verbose logfiles are attached. They show the results for
The Windows logfiles do not show any file allocation messages. This is because the files were allocated using an external tool that used (the Windows equivalent of) fast_allocate. (Otherwise allocating TBs of data files on Windows takes hours instead of seconds, even on SSDs. Any fast allocation bugs don't matter, since this is only testing the ability to mmap files.) You can tell when the dbpath has been cleared out by when local.ns gets allocated. A useful command to see the main timeline in each log is something like:
Some of the smaller tests were done on an i2.8xlarge instance with 8x 800GB local SSDs in RAID0 (~6TB). The tests above this size used a hs1.8xlarge with 16x 2TB local disks in RAID0. The results of the tests are:
Where failures occur, the messages are:
|
| Comments |
| Comment by Mark Benvenuto [ 14/Dec/15 ] | |||||||||||
|
Fixed with | |||||||||||
| Comment by Mark Benvenuto [ 20/Nov/15 ] | |||||||||||
|
Here is the repro I used: *Operating System(: Windows 2008 R2 In the data directory (z:\data\db), I pre-allocated a large number of files:
I started mongodb with the following command:
I ran the following command on mongo:
I also set the page file size to min & max 5000 MB on C:, and 50192 MB on Y: to give the OS as much possible VM space as needed. With a capped collection of 8246337208320 which is 7.5 TB, I could not repro the issue. I believe the reason why this does not repro any longer is because of the fixes: The error errno:487 Attempt to access invalid address. was addressed by I believe 3.0.6 and later will also pass this test. |