[SERVER-9355] Mongodb crashed after - FlushViewOfFile for F:/data/xq.4 failed with error 1117 after 1 attempts taking 7076 ms Created: 15/Apr/13  Updated: 10/Dec/14  Resolved: 23/May/13

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: 2.2.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: John Woakes Assignee: Unassigned
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Azure Worker Role - 3 replica set databases on 3 Azure instances


Attachments: Text File MongoCrash.log     Microsoft Word WADWindowsEventLogsTable.csv    
Issue Links:
Related
is related to SERVER-13681 MongoDB stalls during background flus... Closed
Operating System: Windows
Steps To Reproduce:

This appears to be a random event.

Participants:

 Description   

This happened in our production system in the middle of the night on a Sunday morning. I cannot see anything that might of triggered this. The database did recover once we restarted mongod.

See attached log file for more details. This is failure extracted from that file.

[DataFileSync] FlushViewOfFile for F:/data/xq.4 failed with error 1117 after 1 attempts taking 7076 ms
[DataFileSync]   Fatal Assertion 16387
[DataFileSync] mongod.exe    ...\src\mongo\util\stacktrace.cpp(161)                           mongo::printStackTrace+0x3e
[DataFileSync] mongod.exe    ...\src\mongo\util\assert_util.cpp(126)                          mongo::fassertFailed+0x43
[DataFileSync] mongod.exe    ...\src\mongo\util\mmap_win.cpp(375)                             mongo::WindowsFlushable::flush+0x425
[DataFileSync] mongod.exe    ...\src\mongo\util\mmap.cpp(183)                                 mongo::MongoFile::_flushAll+0x2b3
[DataFileSync] mongod.exe    ...\src\mongo\db\db.cpp(427)                                     mongo::DataFileSync::run+0x24c
[DataFileSync] mongod.exe    ...\src\mongo\util\background.cpp(64)                            mongo::BackgroundJob::jobBody+0x25c
[DataFileSync] mongod.exe    ...\src\third_party\boost\boost\bind\mem_fn_template.hpp(165)    boost::_mfi::mf1<void,mongo::BackgroundJob,boost::shared_ptr<mongo::BackgroundJob::JobStatus> >::operator()+0x47
[DataFileSync] mongod.exe    ...\src\third_party\boost\boost\thread\detail\thread.hpp(63)     boost::detail::thread_data<boost::_bi::bind_t<void,boost::_mfi::mf1<void,mongo::BackgroundJob,boost::shared_ptr<mongo::BackgroundJob::JobStatus> >,boost::_bi::list2<boost::_bi::value<mongo::BackgroundJob * __ptr64>,boost::_bi::value<boost::shared_ptr<mongo::BackgroundJob::JobStatus> > > > >::run+0x31
[DataFileSync] mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(180)  boost::`anonymous namespace'::thread_start_function+0x21
[DataFileSync] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(314)      _callthreadstartex+0x17
[DataFileSync] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(292)      _threadstartex+0x7f
[DataFileSync] kernel32.dll                                                                   BaseThreadInitThunk+0xd
[DataFileSync]
***aborting after fassert() failure



 Comments   
Comment by John Woakes [ 23/May/13 ]

I am not sure what information to post that is not already here. The support number is 113041610370474 and the agent's name is Mike Wong.

I will update this ticket if I get more information.

Comment by Daniel Pasette (Inactive) [ 23/May/13 ]

Thanks for the update John. If you're able to post information we can use to track this issue here in this ticket, it would be much appreciated. Closing this as an active MongoDB issue for now as I don't think there's anything that can be done on the MongoDB side.

Comment by John Woakes [ 22/May/13 ]

Finally got this from Microsoft Azure Support...
-----------------------------------------------------
So the product group finally got back to me over the weekend. Apparently there is a platform bug on the Azure side which results in a lease request that execute right around the time the lease expires to clear the least after they have checked that it is valid which results in drive unmounts. We see in the metrics how the lease renewal occurs around 3 seconds too late in our scenario. They are working on the fix and expect a the fix to be rolled out to production environments in an upcoming release milestone. I am trying to get more exact dates now.
-----------------------------------------------------

Comment by John Woakes [ 24/Apr/13 ]

We are still having Mongo crash or get into an unstable condition nearly daily. I have opened a ticket with Azure and they are trying to get to the root of this. I have been working with them on this. Hopefully will get an answer soon.

Comment by Stennie Steneker (Inactive) [ 24/Apr/13 ]

Hi John,

Do you have any update on this issue?

Thanks,
Stephen

Comment by John Woakes [ 18/Apr/13 ]

This is the latest from MS

So I installed debugdiag which is set to write a dump file if mongod.exe crashes. Debugdiag is on all three instances of MongoWorker. Hopefully, we’ll get a dump soon to analyze. I will check on it again later today and tomorrow morning.

Comment by John Woakes [ 16/Apr/13 ]

Thanks Dan, I have open a ticket with Azure support. I will let you know what happens.

Comment by Daniel Pasette (Inactive) [ 16/Apr/13 ]

I see in the EventLogs you've posted the following event for the ProviderName "WaDrivePrt" at timestamp: 2013-04-14T07:15:41.7262866Z:

'/mongoddblob0.vhd' failed to renew lease the specified XDisk.

This is a about 90 secs after the crash occurred while flushing to disk. This appears to be an issue with the disk subsystem in Azure. Have you tried alerting Azure support with the same details you've included in this ticket?

Comment by John Woakes [ 15/Apr/13 ]

This is the Windows Event Logs from the period.

Comment by Tad Marshall [ 15/Apr/13 ]

Error 1117 is ERROR_IO_DEVICE: The request could not be performed because of an I/O device error.

Can you see if there is anything in the Windows or Azure event logs corresponding to this error?

Generated at Thu Feb 08 03:20:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.