[SERVER-5427] assertion 10085 can't map file memory ns:CrawlQueue.system.namespaces query:{} Created: 27/Mar/12 Updated: 15/Aug/12 Resolved: 19/May/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | 2.0.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Huy Nguyen | Assignee: | Tad Marshall |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Dedicated none-vm Windows 2008 server with 56gb of ram, 16 cores. Single instance (not sharded or replicated) with journaling enabled |
||
| Operating System: | Windows |
| Participants: |
| Description |
|
We've been running this database for a little over 6 months now without any issue. Yesterday I got an alert that the server only have about 2gb of data left so I log on to the server, shut mongod process down (it was running 2.0.2), delete the 10gb of log, and restart it. Server startups fine. I am able to access most of the database on the server just fine except for 2 databases (CrawlMeta and CrawlQueue). Every time I try to do any type of query against these databases, I would get errors like so: Tue Mar 27 14:15:44 [conn1] CreateFileMapping failed d:/data/db/CrawlQueue/CrawlQueue.141 errno:1006 The volume for a fi Looks like QueueMeta have about 160gb of data and CrawlQueue db have about 275gb which is fairly large but it isn't a whole not (not as large as our linux clusters). Mongodb is running as a single instance (not sharded or replicated) with journaling enabled. The server have 56gb of ram. Since the initial startup errors, we've attached additional hard drive that have about 1.6TB to the machine & moved all the data over to this new drive D:\ (data was initially on drive C:). However, after moving the data over, the same error still comes up. I notice the OS only has 1gb of swap enabled so I bumped the swap size to 300gb. That didn't fix it. Please see the link for client errors and server-side logs: https://gist.github.com/2220521 I'm not sure how to get this backup and running again. I've tried version 2.0.4 with the same issue. |
| Comments |
| Comment by Huy Nguyen [ 10/May/12 ] |
|
Sorry I didn't get a chance to do this. When I do, I will let you know. Sorry for the late late reply. I'm swamped with projects. |
| Comment by Ian Whalen (Inactive) [ 10/May/12 ] |
|
@huy, just checking if you were able to run chkdsk or provide any other updates? |
| Comment by Tad Marshall [ 04/Apr/12 ] |
|
Can you run "chkdsk /f" on the volume and see if it reports any issues? It would help to eliminate one consideration from this issue and possibly help us resolve this. Thanks! |
| Comment by Tad Marshall [ 28/Mar/12 ] |
|
It would be great to figure out what happened, but it sounds like there isn't much to work with to figure that out. I am suspicious that some inconsistency snuck into the NTFS metadata for the file: perhaps the file extent didn't match what the NTFS $BITMAP directory said it should be, or some error along those lines. If you see this problem again, it might be worthwhile to run "chkdsk /f" on the NTFS volume before you try a mongod.exe repair operation. The error message you got (error 1006) is Windows itself telling us that something is wrong with the file. Depending on your RAID 1 solution, there might be a RAID level repair function that could be tried. It might be possible to split the RAID volume and examine the two copies independently. They might both show the same problem but perhaps one copy is good and the other copy has the problem. Thanks for the information and I'm glad that the data isn't irreplaceable! |
| Comment by Huy Nguyen [ 28/Mar/12 ] |
|
Tad, I'm afraid you are correct. The CrawlQueue.141 got deleted when I repair the DB. I did not back up the data set. I should have! We are running proper RAID 1 on the server. This should prevent most, if not all data corruption at the hardware level as the corruption has to occur twice on both disks. This doesn't however rule out software level corruption in write instruction. The only application to access these file should only be Mongod and nothing else. If this comes up again, I will make a backup. The software that uses this system is able to re-cover and continue, it just needs to re-do some of the crawling. Not a big deal in our case. |
| Comment by Tad Marshall [ 28/Mar/12 ] |
|
Do you still have the original files that were giving you the error 1006? I have not seen that error before, but searching the web suggests that it is related to disk corruption. As your mongod.exe log is saying, the problem occurs when we try to create a file mapping object in order to map the file into memory. In other words, mongod has not even begun to look at the file. If the original files are still available, we might at least be able to figure out what is going wrong. Since the error mentions d:/data/db/CrawlQueue/CrawlQueue.141 explicitly, it is likely that files 0 through 140 are valid and unaffected. So, possibly some manual intervention on this one file would bring back the entire database. The main thing that repair does is create new files using the data that it is able to read from the old files. So, if it is unable to read the old data, then that data will not be copied and will be lost. Repair is very helpful in removing fragmentation from a database and in producing a usable database that was unusable before because of errors in the files, but if errors prevent it from reading most of your data then most of your data will be lost. Thanks for the report. If you have the original CrawlQueue.141 file and it can be read and copied, it might help us figure out what is wrong with it. |
| Comment by Huy Nguyen [ 27/Mar/12 ] |
|
The repair completed. I am able to start the database back up. However... we've lost most of our data in our CrawlMeta & CrawlQueue. The file size on disk shrunk from 160gb for CrawlMeta to 22gb and 275gb for CrawlQueue to 8gb. Fortuneately, these data are NOT mission critical to us, that's why we're runing single instance on windows 2008. These are our internal crawler & we'll just re-crawl most of these again in about a few days. I'm just reporting this as it might help someone else who is running a similar setup in production. I'm sorry I was un-able to save the original log files as we were short on space. Our crawler is back online & running now. I'm not sure if you're able to do anything with this report. |
| Comment by Eliot Horowitz (Inactive) [ 27/Mar/12 ] |
|
Those are warnings, so keep running for now. |
| Comment by Huy Nguyen [ 27/Mar/12 ] |
|
I am currently running repair on the database but it is throwing error continuously during repair, I'm not sure if I should stop the repair. It currently looks like this: https://gist.github.com/2220609 |