[SERVER-1459] FlushViewOfFile failed 33 (and othe numbers) & Client disconnections - Details - Windows Created: 22/Jul/10 Updated: 12/Jul/16 Resolved: 24/Mar/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 1.4.4, 1.5.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Vadim | Assignee: | Tad Marshall |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Windows Srv 2008 R2, HP DL160G6: 2xXeon E5550, 24 Gb RAM, 4x500 Gb SATA HDD |
||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | Windows | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
I'm testing (writing alot of data) MongoDB (1.4.4 and 1.5.5) on Win 2008 R2 and have this issue: The error is (on the client side): Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. The MongoDB is still working and i can connect to it again. But, with the growth of written data these disconnects happens more frequently. And finally? i've pressed Ctrl+C and find at the log these errors: FlushFileBuffers failed 6 file: /data/db/fortest.16 As you see, client library is NoRM, Mongo starts with this command: mongod -v --cpu --syncdelay 1 -logpath e:\logs\Mongod.txt Clients stop write with last "insert fortest.posts" in the log and get the error in a minute. |
| Comments |
| Comment by Tad Marshall [ 24/Mar/12 ] | ||||||||||||
|
This comment is copied from Below text copied from I am going to close this ticket as "fixed". Most of the information in this report is rather old and refers to versions that will not be getting any further updates: 1.4.2 and 1.6.0. If we see issues related to FlushViewOfFile in version 2.1.1 or later, it's time for a new ticket. The failure of FlushViewOfFile is completely reproducible. Our "core test suite" hits it in every run in Windows. You can reproduce it yourself if you want to: build "test.exe" from the MongoDB GitHub repository, and run "test pdfile --dbpath C:\mongoTest" after creating a C:\mongoTest (or anything you like) folder. On my machine, I get the "FlushViewOfFile error 33" in seconds, every time. We have never seen crashes or data corruption that could be reliably connected with this error. The meaning of the error is that we have asked Windows to sync our memory-mapped file (a portion of a database) to disk NOW and Windows has said "not right now, I'm busy". It doesn't mean that something bad has happened and it doesn't mean that the data will not be written to disk: we just can't force it to happen at that instant. We got some help from a Microsoft engineer who talked to the "NTFS owners" and told us that if we retry repeatedly, we will eventually get a successful response from this API. That's what the new code change does. We limit "repeatedly" to one million retries or 60 seconds, whichever comes first, and this gets us past the failure we see in our test runs every day. On my home machine running Windows 7 x64, the 32-bit version succeeded after 6 retries taking 4 milliseconds. On my work machine running Windows 7 x64 under VirtualBox under Ubuntu 11.10, the 64-bit version succeeded after 24 retries taking 6400 milliseconds. If you read the code (see https://github.com/mongodb/mongo/commit/c0ed1fde6a726909ae99b928a711e3c38837a4ae in the check-in above) you will see that we report every outcome of FlushViewOfFile other than "worked the first time". This means that (for now) every situation that previously reported "FlushViewOfFile failed 33" will still report something about FlushViewOfFile. If this code change does what we think it will do, the new report will say "FlushViewOfFile for <filename> succeeded after <number> attempts taking <number> ms". After we accumulate some experience with this new code, we may (and probably will) limit the reporting to only cases where the count or time exceeds some arbitrary limits, but today's code will report any count other than 1. I hope this isn't just too-long-didn't-read, but if you see issues in 2.1.1 or later, please post a new Jira ticket and let us know if there is more to do. Thanks! | ||||||||||||
| Comment by Tad Marshall [ 05/Dec/11 ] | ||||||||||||
|
I did some more research and single-stepped through some of the NtFlushVirtualMemory() routine that FlushViewOfFile() translates into. NtFlushVirtualMemory() eventually calls MiFlushSectionInternal(), and this is the routine that returns the STATUS_FILE_LOCK_CONFLICT (0xC0000054) error that gets translated into DOS error 33 (ERROR_LOCK_VIOLATION) that we see in the return from FlushViewOfFile(). The funny thing is that NtFlushVirtualMemory() has code to detect this specific error code, and it retries 5 times before returning the error to us. Any other error is returned without retrying ... so Microsoft knows something about this failure, or the code thinks it does. More research is needed. | ||||||||||||
| Comment by Tad Marshall [ 29/Nov/11 ] | ||||||||||||
|
In following up on Dwight's comment from Mar 08 2011 05:10:21 PM UTC (in which he provides a simple and reproducible way to reproduce the FlushViewOfFile() failure with ERROR_LOCK_VIOLATION), this bug shows some interesting behavior. 1) It happens exactly once in each run of 'test --dbpath \data\db\unittest -v --debug --dur --bigfiles' and in the same test every time. It doesn't look random at all. 2) The failing test is called ExtentAllocOrder and the point at which we fail is from within dropDatabase() while syncing the journal with getDur().syncDataAndTruncateJournal(). A call to flushAll() loops through all open memory mapped files, but it is always unittest_ex.3 that triggers the failure. 3) On the theory that this was a race condition in which Windows is busy doing something and we hit an internal Windows lock, I tried adding a Sleep(1) call just before the FlushViewOfFile() call to wait for 1 millisecond before the flush to let Windows catch up. It didn't change anything. I increased the timeout to 10 ms. Fail again. 100 ms, 1000 ms and then 10000 ms changed nothing. In other words, giving Windows 10 seconds to "catch up" changes nothing. This doesn't look like a race condition, and the lock we hit is probably one we set. At least that's what it looks like right now. 4) Matthias suggested calling flushAll() before dropping the database to see what happened. The pre-call to flushAll() doesn't fail, we fail in the same place in the same way even after having called FlushViewOfFile() just an instant beforehand. So, it's not any pending data requiring flushing that is getting in the way, it's all activity that we trigger in dropping the database. I'm still digging, but I wanted to get some notes written down. | ||||||||||||
| Comment by Tad Marshall [ 28/Nov/11 ] | ||||||||||||
|
I think that this is "an OS issue" and may require working with Microsoft to get it fixed. From way back in Windows 2000 days, MS posted this Knowledge Base article: http://support.microsoft.com/kb/266749 . It says "When you create a file with the CreateFileMapping function, you may receive an ERROR_LOCK_VIOLATION response. This error occurs if the NTFS Transaction log becomes full while you are creating the file mapping. Memory Manager is not waiting long enough for NTFS to flush the transaction log and returns the error, causing the function not to succeed. To resolve this problem, obtain the latest service pack for Windows 2000." Right. Here's how the folks working on the OpenJDK addressed it in MappedByteBuffer.c:
I'll investigate how we might get some Microsoft attention to the problem. | ||||||||||||
| Comment by Dwight Merriman [ 08/Mar/11 ] | ||||||||||||
|
i am able to reproduce this with 1.9.x and: ./test --dur pdfile as the command line. i did some digging and could not resolve. a couple notes:
we should open a ticket with msft Tue Mar 08 12:12:57 [testsuite] FlushViewOfFile failed 33 file: /tmp/unittest/unittest_ex.6 | ||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 22/Jul/10 ] | ||||||||||||
|
Given we're getting ready to release 1.6 - not going to look into fixing 1.4.4 right now | ||||||||||||
| Comment by Vadim [ 22/Jul/10 ] | ||||||||||||
|
I'm very sorry. More tests show this problem ONLY in 1.4.4 (not in 1.5.5) |