[SERVER-6477] jstests/dur/closeall.js -- WindowsFlushable::flush triggers fassert after FlushViewOfFile fails with code 487 (ERROR_INVALID_ADDRESS) Created: 17/Jul/12  Updated: 11/Jul/16  Resolved: 06/Aug/12

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 2.2.0-rc1

Type: Bug Priority: Major - P3
Reporter: Aaron Staple Assignee: Tad Marshall
Resolution: Done Votes: 0
Labels: buildbot
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: Windows
Participants:

 Description   

This occurs in the context of a test that repeatedly closes databases.

<http://buildlogs.mongodb.org/build/500477d7d2a60f1426000cf5/test/500495eed2a60f6d76000a43/>

There is a known issue where the test in question might trigger a double exception (see SERVER-5101, SERVER-5724), but there's no specific evidence from the logs that double exceptions occurred in this run.



 Comments   
Comment by auto [ 06/Aug/12 ]

Author:

{u'date': u'2012-08-06T12:16:37-07:00', u'email': u'tad@10gen.com', u'name': u'Tad Marshall'}

Message: SERVER-6477 don't unlock between selecting and flushing file

In MongoFile::_flushAll(), don't pick a file to flush in a
LockMongoFilesShared and then release the lock before flushing.
Opens a race condition where another thread can close the file
before we hit the flush call.
Branch: master
https://github.com/mongodb/mongo/commit/e4887227c71885769f141c5d47161c1604c9e97e

Comment by Tad Marshall [ 06/Aug/12 ]

In all the cases I looked at, the failing call to WindowsFlushable::flush is always from line 185 in src/mongo/util/mmap.cpp. At this point in the code, we have selected a file to flush while locked, and then we unlock before calling the flush routine. This unlocking allows another thread to close the database before we get to call flush(). In fact, we are probably holding a pointer to freed memory at that instant, so a variety of segfaults or memory corruptions are possible, but in the normal case the memory is intact and we are just trying to flush a file to disk when the file is no longer open.

The difference between Linux and Windows is that the viewForFlushing() routine that sets up the address to use for flushing is called inside the lock in Windows and is called after the lock is released in Linux. This means that a thread switch triggered by releasing the lock will be much more likely to throw the race condition in Windows than in Linux, but there is an opportunity for a race in either OS.

Comment by Tad Marshall [ 23/Jul/12 ]

Happened again:

http://buildbot.mongodb.org/builders/Windows%2064-bit%202008%2B/builds/504
http://buildbot.mongodb.org/builders/Windows%2064-bit%202008%2B/builds/504/steps/test_9/logs/stdio
http://buildlogs.mongodb.org/build/500d1ef6d2a60f6d0d00079a/test/500d3b4fd2a60f3ab4000d1e/

 m30001| Mon Jul 23 11:55:24 [conn2] command admin.$cmd command: { closeAllDatabases: 1.0 } ntoreturn:1 keyUpdates:0 locks(micros) W:206481 reslen:93 207ms
 m30001| Mon Jul 23 11:55:24 [conn3] Assertion: 15927:can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error
 m30001| Mon Jul 23 11:55:24 [conn3] opening db:  local
 m30001| mongod.exe    ...\src\mongo\util\assert_util.cpp(154)                          mongo::msgasserted+0xe3
 m30001| mongod.exe    ...\src\mongo\db\database.cpp(434)                               mongo::DatabaseHolder::getOrCreate+0x360
 m30001| mongod.exe    ...\src\mongo\db\client.cpp(305)                                 mongo::Client::Context::_finishInit+0xb0
 m30001| mongod.exe    ...\src\mongo\db\client.cpp(212)                                 mongo::Client::Context::Context+0xea
 m30001| mongod.exe    ...\src\mongo\db\client.cpp(244)                                 mongo::Client::ReadContext::ReadContext+0x324
 m30001| mongod.exe    ...\src\mongo\db\instance.cpp(672)                               mongo::receivedGetMore+0x33c
 m30001| mongod.exe    ...\src\mongo\db\instance.cpp(395)                               mongo::assembleResponse+0x310
 m30001| mongod.exe    ...\src\mongo\db\db.cpp(193)                                     mongo::MyMessageHandler::process+0xf5
 m30001| mongod.exe    ...\src\mongo\util\net\message_server_port.cpp(87)               mongo::pms::threadRun+0x59a
 m30001| mongod.exe    ...\src\mongo\util\stacktrace.cpp(172)                           mongo::printStackTrace+0x101
 m30001| mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(180)  boost::`anonymous namespace'::thread_start_function+0x21
 m30001| mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(292)      _threadstartex+0x7f
 m30001| kernel32.dll                                                                   BaseThreadInitThunk+0xd
 m30001| Mon Jul 23 11:55:24 [conn3] { $err: "can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error", code: 15927 }
 m30001| mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(314)      _callthreadstartex+0x17
 m30001| Mon Jul 23 11:55:24 [conn3] getmore local.oplog.$main cursorid:112960654558991 ntoreturn:0 keyUpdates:0 exception: can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error code:15927 locks(micros) W:34 r:419254 nreturned:0 reslen:20 626ms
 m30001| Mon Jul 23 11:55:24 [conn2] DatabaseHolder::closeAll path:/data/db/closeall
 m30001| Mon Jul 23 11:55:24 [conn2] command admin.$cmd command: { closeAllDatabases: 1.0 } ntoreturn:1 keyUpdates:0 locks(micros) W:206284 reslen:93 206ms
 m30001| Mon Jul 23 11:55:24 [conn2] DatabaseHolder::closeAll path:/data/db/closeall
 m30001| Mon Jul 23 11:55:24 [conn2] command admin.$cmd command: { closeAllDatabases: 1.0 } ntoreturn:1 keyUpdates:0 locks(micros) W:205661 reslen:93 206ms
 m30002| Mon Jul 23 11:55:24 [replslave] repl: sleep 2 sec before next pass
2012-07-23 07:55:26 EDT	
 m30001| Mon Jul 23 11:55:24 [conn2] DatabaseHolder::closeAll path:/data/db/closeall
 m30001| Mon Jul 23 11:55:25 [conn2] DurParanoid map check 161ms for 224MB
 m30001| Mon Jul 23 11:55:25 [conn2] DatabaseHolder::closeAll path:/data/db/closeall
 m30001| Mon Jul 23 11:55:25 [DataFileSync] FlushViewOfFile for /data/db/closeall/closealltest.0 failed with error 487 after 1 attempts taking 6 ms
 m30001| Mon Jul 23 11:55:25 [DataFileSync]   Fatal Assertion 16387
 m30001| Mon Jul 23 11:55:25 [conn2] command admin.$cmd command: { closeAllDatabases: 1.0 } ntoreturn:1 keyUpdates:0 locks(micros) W:204245 reslen:93 205ms
 m30001| Mon Jul 23 11:55:25 [conn2] DatabaseHolder::closeAll path:/data/db/closeall
 m30001| Mon Jul 23 11:55:25 [conn2] command admin.$cmd command: { closeAllDatabases: 1.0 } ntoreturn:1 keyUpdates:0 locks(micros) W:204005 reslen:93 204ms
 m30001| Mon Jul 23 11:55:25 [conn2] DatabaseHolder::closeAll path:/data/db/closeall
 m30001| Mon Jul 23 11:55:25 [conn2] command admin.$cmd command: { closeAllDatabases: 1.0 } ntoreturn:1 keyUpdates:0 locks(micros) W:207305 reslen:93 208ms
 m30001| Mon Jul 23 11:55:25 [conn2] DatabaseHolder::closeAll path:/data/db/closeall
 m30001| Mon Jul 23 11:55:25 [conn2] command admin.$cmd command: { closeAllDatabases: 1.0 } ntoreturn:1 keyUpdates:0 locks(micros) W:211791 reslen:93 212ms
 m30001| Mon Jul 23 11:55:26 [conn2] DatabaseHolder::closeAll path:/data/db/closeall
 m30001| Mon Jul 23 11:55:26 [conn2] command admin.$cmd command: { closeAllDatabases: 1.0 } ntoreturn:1 keyUpdates:0 locks(micros) W:205611 reslen:93 206ms
 m30001| Mon Jul 23 11:55:26 [conn2] DatabaseHolder::closeAll path:/data/db/closeall
 m30001| Mon Jul 23 11:55:26 [conn2] command admin.$cmd command: { closeAllDatabases: 1.0 } ntoreturn:1 keyUpdates:0 locks(micros) W:213280 reslen:93 213ms
 m30001| mongod.exe    ...\src\mongo\util\assert_util.cpp(126)                          mongo::fassertFailed+0x43
 m30001| mongod.exe    ...\src\mongo\util\mmap_win.cpp(385)                             mongo::WindowsFlushable::flush+0x40d
 m30001| mongod.exe    ...\src\mongo\util\mmap.cpp(185)                                 mongo::MongoFile::_flushAll+0x2bd
 m30001| mongod.exe    ...\src\mongo\db\db.cpp(427)                                     mongo::DataFileSync::run+0x24c
 m30001| mongod.exe    ...\src\mongo\util\background.cpp(64)                            mongo::BackgroundJob::jobBody+0x25c
 m30001| mongod.exe    ...\src\mongo\util\stacktrace.cpp(172)                           mongo::printStackTrace+0x101
 m30001| mongod.exe    ...\src\third_party\boost\boost\bind\mem_fn_template.hpp(165)    boost::_mfi::mf1<void,mongo::BackgroundJob,boost::shared_ptr<mongo::BackgroundJob::JobStatus> >::operator()+0x47
 m30001| mongod.exe    ...\src\third_party\boost\boost\thread\detail\thread.hpp(63)     boost::detail::thread_data<boost::_bi::bind_t<void,boost::_mfi::mf1<void,mongo::BackgroundJob,boost::shared_ptr<mongo::BackgroundJob::JobStatus> >,boost::_bi::list2<boost::_bi::value<mongo::BackgroundJob * __ptr64>,boost::_bi::value<boost::shared_ptr<mongo::BackgroundJob::JobStatus> > > > >::run+0x31
 m30001| mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(180)  boost::`anonymous namespace'::thread_start_function+0x21
 m30001| mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(292)      _threadstartex+0x7f
 m30001| mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(314)      _callthreadstartex+0x17
 m30001| Mon Jul 23 11:55:26 [DataFileSync] 
 m30001| 
 m30001| ***aborting after fassert() failure

Comment by Tad Marshall [ 23/Jul/12 ]

The code already tests for _view being zero, so that is not what is happening.

Comment by Tad Marshall [ 20/Jul/12 ]

It seems like we're trying to flush the local database, but we didn't manage to open it due to "can't open database in a read lock. if db was just closed, consider retrying the query. might otherwise indicate an internal error". DOS error 487 is ERROR_INVALID_ADDRESS ... we should print the address to check, but I suspect it will be zero, meaning that we never mapped it in the first place. The old code (prior to the addition of the fassert) would have just logged this and moved on.

Comment by Aaron Staple [ 20/Jul/12 ]

Happened to see it again:

<http://buildlogs.mongodb.org/build/5008ccf2d2a60f13f90009c6/test/5008ea98d2a60f5ddc000f39/>

Comment by Aaron Staple [ 17/Jul/12 ]

This could potentially have been caused by the memory corruption in SERVER-6481. Let's see if it happens again.

Here's another run of the same test reporting memory corruption.
<http://buildlogs.mongodb.org/build/500492a4d2a60f6a260005c2/test/5004afe5d2a60f4e42000cbb/>

Generated at Thu Feb 08 03:11:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.