[SERVER-1459] FlushViewOfFile failed 33 (and othe numbers) & Client disconnections - Details - Windows Created: 22/Jul/10  Updated: 12/Jul/16  Resolved: 24/Mar/12

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 1.4.4, 1.5.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vadim Assignee: Tad Marshall
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows Srv 2008 R2, HP DL160G6: 2xXeon E5550, 24 Gb RAM, 4x500 Gb SATA HDD


Attachments: Zip Archive Mongo-log.zip    
Issue Links:
Depends
Duplicate
duplicates SERVER-1163 Writing 44K records a second and mong... Closed
Related
is related to SERVER-1163 Writing 44K records a second and mong... Closed
is related to SERVER-5819 Use FileAllocator on Windows Closed
Backwards Compatibility: Fully Compatible
Operating System: Windows
Participants:

 Description   

I'm testing (writing alot of data) MongoDB (1.4.4 and 1.5.5) on Win 2008 R2 and have this issue:
During "new extent allocations" i see the following error: "FlushViewOfFile failed 33".
In almost all cases this is not a problem (except several seconds delay in writing)/ But some times it leads to client disconnects (all clients in the same time with identical error).

The error is (on the client side):

Unable to write data to the transport connection: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
at System.Net.Sockets.NetworkStream.Write(Byte[] buffer, Int32 offset, Int32 size)
at Norm.Connection.Write(Byte[] bytes, Int32 start, Int32 size)
at Norm.Protocol.Messages.InsertMessage`1.Execute()
at Norm.Collections.MongoCollection`1.Insert(IEnumerable`1 documentsToInsert
...

The MongoDB is still working and i can connect to it again. But, with the growth of written data these disconnects happens more frequently.
Writing data in 2 threads (1 doc is 7 attributes and about 1 kB in size) fails:
1. First time - after 15 millions of docs written.
2. Second - after 5 millions,
3. Third - after 2 millions.
At the third time database is about 50Gb.

And finally? i've pressed Ctrl+C and find at the log these errors:

FlushFileBuffers failed 6 file: /data/db/fortest.16
FlushFileBuffers failed 6 file: /data/db/fortest.17
FlushFileBuffers failed 6 file: /data/db/fortest.18
FlushFileBuffers failed 6 file: /data/db/fortest.19
FlushFileBuffers failed 6 file: /data/db/fortest.20
FlushFileBuffers failed 6 file: /data/db/fortest.21
FlushFileBuffers failed 6 file: /data/db/fortest.22
FlushFileBuffers failed 6 file: /data/db/fortest.23
FlushFileBuffers failed 6 file: /data/db/fortest.24
FlushViewOfFile failed 487 file: /data/db/fortest.26
FlushFileBuffers failed 6 file: /data/db/fortest.26
FlushViewOfFile failed 487 file: /data/db/fortest.0
FlushFileBuffers failed 6 file: /data/db/fortest.0
FlushViewOfFile failed 487 file: /data/db/fortest.1
FlushFileBuffers failed 6 file: /data/db/fortest.1
FlushViewOfFile failed 487 file: /data/db/fortest.2
FlushFileBuffers failed 6 file: /data/db/fortest.2
FlushFileBuffers failed 6 file: /data/db/fortest.3
---------

As you see, client library is NoRM,
Mongo version is 1.4.4 64 bit (1.5.5 shows the same behavior)

Mongo starts with this command: mongod -v --cpu --syncdelay 1 -logpath e:\logs\Mongod.txt
The log file is in attachment (i've cut off the middle part).

Clients stop write with last "insert fortest.posts" in the log and get the error in a minute.



 Comments   
Comment by Tad Marshall [ 24/Mar/12 ]

This comment is copied from SERVER-1163. The specific version referenced here (1.4.4 and 1.5.5) are different, but the time-frame is the same and the FlushViewOfFile error 33 issues are the same. The extra ingredient this bug report contains is a few different FlushViewOfFile error codes and some failures of FlushFileBuffers as well. But no new customer information has been added since Jul 22 2010 and that report said that the problems were not reproducible in version 1.5.5. Since the FlushViewOfFile failure has now been addressed, and we don't have anything to work with for the other errors, this ticket should be closed.

Below text copied from SERVER-1163:

I am going to close this ticket as "fixed". Most of the information in this report is rather old and refers to versions that will not be getting any further updates: 1.4.2 and 1.6.0. If we see issues related to FlushViewOfFile in version 2.1.1 or later, it's time for a new ticket.

The failure of FlushViewOfFile is completely reproducible. Our "core test suite" hits it in every run in Windows. You can reproduce it yourself if you want to: build "test.exe" from the MongoDB GitHub repository, and run "test pdfile --dbpath C:\mongoTest" after creating a C:\mongoTest (or anything you like) folder. On my machine, I get the "FlushViewOfFile error 33" in seconds, every time.

We have never seen crashes or data corruption that could be reliably connected with this error. The meaning of the error is that we have asked Windows to sync our memory-mapped file (a portion of a database) to disk NOW and Windows has said "not right now, I'm busy". It doesn't mean that something bad has happened and it doesn't mean that the data will not be written to disk: we just can't force it to happen at that instant.

We got some help from a Microsoft engineer who talked to the "NTFS owners" and told us that if we retry repeatedly, we will eventually get a successful response from this API.

That's what the new code change does. We limit "repeatedly" to one million retries or 60 seconds, whichever comes first, and this gets us past the failure we see in our test runs every day. On my home machine running Windows 7 x64, the 32-bit version succeeded after 6 retries taking 4 milliseconds. On my work machine running Windows 7 x64 under VirtualBox under Ubuntu 11.10, the 64-bit version succeeded after 24 retries taking 6400 milliseconds. If you read the code (see https://github.com/mongodb/mongo/commit/c0ed1fde6a726909ae99b928a711e3c38837a4ae in the check-in above) you will see that we report every outcome of FlushViewOfFile other than "worked the first time". This means that (for now) every situation that previously reported "FlushViewOfFile failed 33" will still report something about FlushViewOfFile. If this code change does what we think it will do, the new report will say "FlushViewOfFile for <filename> succeeded after <number> attempts taking <number> ms". After we accumulate some experience with this new code, we may (and probably will) limit the reporting to only cases where the count or time exceeds some arbitrary limits, but today's code will report any count other than 1.

I hope this isn't just too-long-didn't-read, but if you see issues in 2.1.1 or later, please post a new Jira ticket and let us know if there is more to do. Thanks!

Comment by Tad Marshall [ 05/Dec/11 ]

I did some more research and single-stepped through some of the NtFlushVirtualMemory() routine that FlushViewOfFile() translates into. NtFlushVirtualMemory() eventually calls MiFlushSectionInternal(), and this is the routine that returns the STATUS_FILE_LOCK_CONFLICT (0xC0000054) error that gets translated into DOS error 33 (ERROR_LOCK_VIOLATION) that we see in the return from FlushViewOfFile(). The funny thing is that NtFlushVirtualMemory() has code to detect this specific error code, and it retries 5 times before returning the error to us. Any other error is returned without retrying ... so Microsoft knows something about this failure, or the code thinks it does. More research is needed.

Comment by Tad Marshall [ 29/Nov/11 ]

In following up on Dwight's comment from Mar 08 2011 05:10:21 PM UTC (in which he provides a simple and reproducible way to reproduce the FlushViewOfFile() failure with ERROR_LOCK_VIOLATION), this bug shows some interesting behavior.

1) It happens exactly once in each run of 'test --dbpath \data\db\unittest -v --debug --dur --bigfiles' and in the same test every time. It doesn't look random at all.

2) The failing test is called ExtentAllocOrder and the point at which we fail is from within dropDatabase() while syncing the journal with getDur().syncDataAndTruncateJournal(). A call to flushAll() loops through all open memory mapped files, but it is always unittest_ex.3 that triggers the failure.

3) On the theory that this was a race condition in which Windows is busy doing something and we hit an internal Windows lock, I tried adding a Sleep(1) call just before the FlushViewOfFile() call to wait for 1 millisecond before the flush to let Windows catch up. It didn't change anything. I increased the timeout to 10 ms. Fail again. 100 ms, 1000 ms and then 10000 ms changed nothing. In other words, giving Windows 10 seconds to "catch up" changes nothing. This doesn't look like a race condition, and the lock we hit is probably one we set. At least that's what it looks like right now.

4) Matthias suggested calling flushAll() before dropping the database to see what happened. The pre-call to flushAll() doesn't fail, we fail in the same place in the same way even after having called FlushViewOfFile() just an instant beforehand. So, it's not any pending data requiring flushing that is getting in the way, it's all activity that we trigger in dropping the database.

I'm still digging, but I wanted to get some notes written down.

Comment by Tad Marshall [ 28/Nov/11 ]

I think that this is "an OS issue" and may require working with Microsoft to get it fixed.

From way back in Windows 2000 days, MS posted this Knowledge Base article: http://support.microsoft.com/kb/266749 .

It says "When you create a file with the CreateFileMapping function, you may receive an ERROR_LOCK_VIOLATION response. This error occurs if the NTFS Transaction log becomes full while you are creating the file mapping. Memory Manager is not waiting long enough for NTFS to flush the transaction log and returns the error, causing the function not to succeed. To resolve this problem, obtain the latest service pack for Windows 2000." Right.

Here's how the folks working on the OpenJDK addressed it in MappedByteBuffer.c:

    /*
     * FlushViewOfFile can fail with ERROR_LOCK_VIOLATION if the memory
     * system is writing dirty pages to disk. As there is no way to
     * synchronize the flushing then we retry a limited number of times.
     */
    retry = 0;
    do {
        result = FlushViewOfFile(a, (DWORD)len);
        if ((result != 0) || (GetLastError() != ERROR_LOCK_VIOLATION))
            break;
        retry++;
    } while (retry < 3);

I'll investigate how we might get some Microsoft attention to the problem.

Comment by Dwight Merriman [ 08/Mar/11 ]

i am able to reproduce this with 1.9.x and:

./test --dur pdfile

as the command line.

i did some digging and could not resolve. a couple notes:

we should open a ticket with msft

Tue Mar 08 12:12:57 [testsuite] FlushViewOfFile failed 33 file: /tmp/unittest/unittest_ex.6

Comment by Eliot Horowitz (Inactive) [ 22/Jul/10 ]

Given we're getting ready to release 1.6 - not going to look into fixing 1.4.4 right now

Comment by Vadim [ 22/Jul/10 ]

I'm very sorry. More tests show this problem ONLY in 1.4.4 (not in 1.5.5)

Generated at Thu Feb 08 02:57:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.