[SERVER-7068] Windows access violation in NamespaceDetails::__stdAlloc() during replication Created: 18/Sep/12  Updated: 15/Feb/13  Resolved: 06/Nov/12

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tad Marshall Assignee: Tad Marshall
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows 64
Server 2008 R2 Ent.
7 GB Ram
Windows Azure?


Attachments: Zip Archive log.zip    
Issue Links:
Depends
Duplicate
duplicates SERVER-7238 power of 2 allocation breaks with doc... Closed
Related
related to SERVER-7077 Failure to journal list changes in Na... Closed
Operating System: ALL
Participants:

 Description   

We have at least one and maybe two reports of an access violation in Windows v2.2.0 following a log message of "~~ Assertion - cur out of range in _alloc()", which comes from NamespaceDetails::__stdAlloc(); line 335 in src/mongo/db/namespace_details.cpp. One report includes a stack trace.

The value following "alloc()" is supposed to be a DiskLoc found in deletedList bucket 18. Other samples of this assertion refer to buckets 5 and 9, all with chain:1.

Tue Sep 18 15:43:52 [repl writer worker 2]  xxx.xxx ~~ Assertion - cur out of range in _alloc() 3c9d9bc:3c9d6b4 a:63560124 b:18 chain:1
                                           mongod.exe    ...\src\mongo\db\namespace_details.cpp(337)                      mongo::NamespaceDetails::__stdAlloc+0x344
Tue Sep 18 15:43:52 [repl writer worker 2] mongod.exe    ...\src\mongo\db\pdfile.cpp(1466)                                mongo::DataFileMgr::insert+0x693
Tue Sep 18 15:43:52 [repl writer worker 2] mongod.exe    ...\src\mongo\db\pdfile.cpp(1217)                                mongo::DataFileMgr::insertWithObjMod+0x48
Tue Sep 18 15:43:52 [repl writer worker 2] mongod.exe    ...\src\mongo\db\ops\update.cpp(458)                             mongo::_updateObjects+0x200b
Tue Sep 18 15:43:52 [repl writer worker 2] mongod.exe    ...\src\mongo\db\ops\update.cpp(487)                             mongo::updateObjects+0x288
Tue Sep 18 15:43:52 [repl writer worker 2] mongod.exe    ...\src\mongo\db\oplog.cpp(779)                                  mongo::applyOperation_inlock+0x4e9
Tue Sep 18 15:43:52 [repl writer worker 2] mongod.exe    ...\src\mongo\db\repl\rs_sync.cpp(84)                            mongo::replset::SyncTail::syncApply+0x245
Tue Sep 18 15:43:52 [repl writer worker 2] mongod.exe    ...\src\mongo\db\repl\rs_sync.cpp(116)                           mongo::replset::multiSyncApply+0x50
Tue Sep 18 15:43:52 [repl writer worker 2] mongod.exe    ...\src\mongo\util\concurrency\thread_pool.cpp(66)               mongo::threadpool::Worker::loop+0x94
Tue Sep 18 15:43:52 [repl writer worker 2] mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(180)  boost::`anonymous namespace'::thread_start_function+0x21
Tue Sep 18 15:43:52 [repl writer worker 2] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(314)      _callthreadstartex+0x17
Tue Sep 18 15:43:52 [repl writer worker 2] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(292)      _threadstartex+0x7f
Tue Sep 18 15:43:52 [repl writer worker 2] kernel32.dll                                                                   BaseThreadInitThunk+0xd
Tue Sep 18 15:43:52 [repl writer worker 2] *** unhandled exception (access violation) at 0x000000013F44FE39, terminating
Tue Sep 18 15:43:52 [repl writer worker 2] *** access violation was a write to 0x000000491AC9D90C
Tue Sep 18 15:43:52 [repl writer worker 2] *** stack trace for unhandled exception:
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\namespace_details.cpp(339)                      mongo::NamespaceDetails::__stdAlloc+0x359
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\pdfile.cpp(1466)                                mongo::DataFileMgr::insert+0x693
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\pdfile.cpp(1217)                                mongo::DataFileMgr::insertWithObjMod+0x48
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\ops\update.cpp(458)                             mongo::_updateObjects+0x200b
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\ops\update.cpp(487)                             mongo::updateObjects+0x288
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\oplog.cpp(779)                                  mongo::applyOperation_inlock+0x4e9
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\repl\rs_sync.cpp(84)                            mongo::replset::SyncTail::syncApply+0x245
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\repl\rs_sync.cpp(116)                           mongo::replset::multiSyncApply+0x50
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\util\concurrency\thread_pool.cpp(66)               mongo::threadpool::Worker::loop+0x94
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(180)  boost::`anonymous namespace'::thread_start_function+0x21
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(314)      _callthreadstartex+0x17
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(292)      _threadstartex+0x7f
Tue Sep 18 15:43:53 [repl writer worker 2] kernel32.dll                                                                   BaseThreadInitThunk+0xd
Tue Sep 18 15:43:53 [repl writer worker 2] writing minidump diagnostic file mongo.dmp
Tue Sep 18 15:43:53 dbexit: unhandled exception
Tue Sep 18 15:43:53 [repl writer worker 2] shutdown: going to close listening sockets...
Tue Sep 18 15:43:53 [repl writer worker 2] closing listening socket: 524
Tue Sep 18 15:43:53 [repl writer worker 2] shutdown: going to flush diaglog...
Tue Sep 18 15:43:53 [repl writer worker 2] shutdown: going to close sockets...
Tue Sep 18 15:43:53 [repl writer worker 2] shutdown: waiting for fs preallocator...
Tue Sep 18 15:43:53 [repl writer worker 2] shutdown: lock for final commit...
Tue Sep 18 15:43:53 [repl writer worker 2]  xxx.xxx Assertion failure threadState() == 0 src\mongo\db\d_concurrency.cpp 145
Tue Sep 18 15:43:53 [rsBackgroundSync] Socket recv() errno:10053 An established connection was aborted by the software in your host machine. 10.28.212.72:27017
Tue Sep 18 15:43:53 [rsBackgroundSync] SocketException: remote: 10.28.212.72:27017 error: 9001 socket exception [1] server [10.28.212.72:27017] 
Tue Sep 18 15:43:53 [rsBackgroundSync] replSet db exception in producer: 10278 dbclient error communicating with server: 10.28.212.72:27017
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\util\stacktrace.cpp(161)       mongo::printStackTrace+0x3e
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\util\assert_util.cpp(109)      mongo::verifyFailed+0xdc
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\d_concurrency.cpp(145)      mongo::WrapperForQLock::lock_R_try+0x60
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\d_concurrency.cpp(479)      mongo::Lock::GlobalRead::GlobalRead+0x95
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\d_concurrency.cpp(796)      mongo::readlocktry::readlocktry+0x47
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\instance.cpp(977)           mongo::shutdownServer+0x1b4
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\instance.cpp(1060)          mongo::dbexit+0x264
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\db.cpp(1393)                mongo::exceptionFilter+0x1e4
Tue Sep 18 15:43:53 [repl writer worker 2] kernel32.dll                                               UnhandledExceptionFilter+0x160
Tue Sep 18 15:43:53 [repl writer worker 2] ntdll.dll                                                  RtlUserThreadStart+0xae
Tue Sep 18 15:43:53 [repl writer worker 2] ntdll.dll                                                  _C_specific_handler+0x14c
Tue Sep 18 15:43:53 [repl writer worker 2] ntdll.dll                                                  _chkstk+0x9d
Tue Sep 18 15:43:53 [repl writer worker 2] ntdll.dll                                                  RtlDestroyMemoryBlockLookaside+0x36b
Tue Sep 18 15:43:53 [repl writer worker 2] ntdll.dll                                                  KiUserExceptionDispatcher+0x2e
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\namespace_details.cpp(339)  mongo::NamespaceDetails::__stdAlloc+0x359
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\pdfile.cpp(1466)            mongo::DataFileMgr::insert+0x693
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\pdfile.cpp(1217)            mongo::DataFileMgr::insertWithObjMod+0x48
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\ops\update.cpp(458)         mongo::_updateObjects+0x200b
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\ops\update.cpp(487)         mongo::updateObjects+0x288
Tue Sep 18 15:43:53 [repl writer worker 2] mongod.exe    ...\src\mongo\db\oplog.cpp(779)              mongo::applyOperation_inlock+0x4e9
Tue Sep 18 15:43:53 shutdown failed with exception
Tue Sep 18 15:43:53 dbexit: really exiting now



 Comments   
Comment by Tad Marshall [ 06/Nov/12 ]

This issue appears to be caused by SERVER-7238, so I'm marking it as a duplicate of that ticket. The access violation will be addressed by SERVER-7077, but the corruption that triggers this issue is SERVER-7238.

Comment by galam yum [ 15/Oct/12 ]

Hi
sorry for delay response. When I firstly noticed this issue I thought it might be caused by TTL which is a good function for me. After that, I read all of comments and realized this problem might be more serious than I thought so I am unable to switch from 2.0.7 to 2.2.0 immediately.

I use mongodb on windows 7, works fine and I did run a few tests against 2.2.0 and the result was still fine.

As you said the journal mode can't protect this error so I have to wait a little bit time.

Any work around besides of replication?

Regards

Comment by Tad Marshall [ 12/Oct/12 ]

galam, are you seeing this issue or something like it? If so, can you please provide information on what you are seeing?

The access violation is caused by some incorrect "error handling" code that tries to patch the bad information that led to the "Assertion - cur out of range" error.

The "Assertion - cur out of range" error is due to corruption of the database file. In at least one case, this was matched by an NTFS error logged by Windows. This may look something like this:

Log Name:      System
Source:        Ntfs
Date:          <some date>
Event ID:      57
Task Category: (2)
Level:         Warning
Keywords:      Classic
User:          N/A
Computer:      <some computer name>
Description:
The system failed to flush data to the transaction log. Corruption may occur.

Single-server durability requires that there not be errors in the disk subsystem (even with journaling), but we have not ruled out the possibility that there are other issues.

If you have any information to add here, please do so.

The incorrect error handling will be replaced with a Fatal Assertion (so, no access violation, but the server will still abort) in ticket SERVER-7077.

Comment by galam yum [ 12/Oct/12 ]

Hi All

any progress? Due to I use mongodb on windows so this issue is what I concern recently.

Comment by Ben Becker [ 22/Sep/12 ]

Hi Gavin,

The restore process would not generate new ids. I've tried to reproduce with a TTL collection on v2.2, but no luck so far. Regarding the lushstories.messages.sent collection, it does seem to be the only namespace with a bad freelist entry, so dropping that collection and running mongorestore from a backup may help. A few questions to help us find the root cause:

  1. Did you receive any error messages from mongodump during this process, or any errors in the log while mongodump was running?
  2. Could you post all MapReduce jobs that are run against the lushstories.stories collection?
  3. Do you still have the original data files? Could you issue an fsync+lock command, create an archive of the data files (e.g. tar -cjf archive.tar.bz2 /path/to/dbfiles/*), unlock the database, and send the files to us? I will e-mail you the scp server information now.
  4. Could you check your system event logs for any messages related to the disk or memory management subsystems?

Just to note, the first failure to allocate in the logs was at Tue Sep 18 00:44:41, so the corruption in this case likely started just before this.

Thanks Gavin,
Ben

Comment by Gavin Sansom [ 22/Sep/12 ]

It also seems that we had data missing after the restore. I had lots of complaints of stories, messages even member accounts had disapeared. I had to rollback to the original data files, so we are back to square one.

Would it help to drop that entire collection messages.sent, then maybe just recover that from a restore?

Comment by Gavin Sansom [ 22/Sep/12 ]

The backup and restore has not gone well. Although the database seems much better, I have lost all my consistency around id's. I have collections and forgeign keys that no longer match. A restore would not generate new id's would it, I restored into a completely blank database.

Argh!

Comment by Tad Marshall [ 21/Sep/12 ]

Also, you should use the --repair option on the mongodump command, documented in the link Ben posted above.

Comment by Tad Marshall [ 21/Sep/12 ]

Yes, please post full logs, including everything before and after the failed repair.

The log snippet you posted above seems to suggest that the same corruption in the deleted files list was happening in the target database in your repair, i.e. the new one that was created to hold the clone of the original. This doesn't seem to make sense, but it seems to be what the log is saying unless I'm reading it wrong.

Is this node part of a replica set, or is it a single node?

Comment by Gavin Sansom [ 21/Sep/12 ]

Hi Ben
I shall schedule some downtime and perform the dump/restore process.

This all started to happen as soon as I enabled a TTL index on this collection. Once this happened these restarts began to occur.

I'm happy to give you the full logs if thats going to help resolve this.

Cheers................gav

Comment by Ben Becker [ 21/Sep/12 ]

Hi Gavin,

Probably the best way to get back up and running is to use mongodump and mongorestore to copy your database to a new dbpath (or server). I'd like to understand a bit more about what happened in your case. Would it be possible to attach full logs from this instance and the original database files (all files if possible)? Please let me know if you prefer not to attach this data to this ticket, or if the files are too large to attach. I'd be happy to provide a secure SCP server we for data file transfer.

If the mongodump/mongorestoe process does not solve the issue, let us know and we can formulate other solutions. Note that it seems unlikely that downgrading to v2.0 will fix the current state of the database.

Best Regards,
Ben

Comment by Gavin Sansom [ 21/Sep/12 ]

I'm still getting mongod restarts every hour or so, getting a few complaints of errors occurring from users. I thought if I removed the TTL index on messages.sent it would help, didnt seem to.
Is there a work around, is it work going back a version?

Comment by Ben Becker [ 20/Sep/12 ]

There appears to be another case of a non-durable write in pdfile.cpp L647 in the getEmptyLoc() function. Note GETOFS returns a reference to the DiskLoc's ofs member:

    void getEmptyLoc(const char *ns, const DiskLoc extentLoc, int extentLength, bool capped, /*out*/DiskLoc& emptyLoc, /*out*/int& delRecLength) { 
        emptyLoc = extentLoc;
        emptyLoc.inc( Extent::HeaderSize() );
        delRecLength = extentLength - Extent::HeaderSize();
        if( delRecLength >= 32*1024 && str::contains(ns, '$') && !capped ) { 
            // probably an index. so skip forward to keep its records page aligned 
            int& ofs = emptyLoc.GETOFS();
            int newOfs = (ofs + 0xfff) & ~0xfff; 
            delRecLength -= (newOfs-ofs);
            dassert( delRecLength > 0 );
            ofs = newOfs;
        }
    }

Strike that; this function is only called where emptyLoc is on the stack; not mmap'd.

Comment by Gavin Sansom [ 19/Sep/12 ]

I thinks this originates from the repair:

Wed Sep 19 12:50:55 [conn8884] end connection 172.16.2.131:49433 (3477 connections now open)
Wed Sep 19 12:50:56 [conn2612] lushstories ~~ Assertion - cur out of range in _alloc() 54426766:554c3671 a:1413637990 b:18 chain:1
mongod.exe ...\src\mongo\db\namespace_details.cpp(337) mongo::NamespaceDetails::__stdAlloc+0x344
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\pdfile.cpp(1466) mongo::DataFileMgr::insert+0x693
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\pdfile.cpp(1217) mongo::DataFileMgr::insertWithObjMod+0x48
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\cloner.cpp(165) mongo::Cloner::Fun::operator()+0x714
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\client\dbclient.cpp(826) mongo::DBClientBase::query+0x102
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\cloner.cpp(217) mongo::Cloner::copy+0x429
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\cloner.cpp(447) mongo::Cloner::go+0x1891
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\cloner.cpp(298) mongo::Cloner::go+0x138
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\pdfile.cpp(1796) mongo::repairDatabase+0x6d8
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\dbcommands.cpp(426) mongo::CmdRepairDatabase::run+0x23e
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\dbcommands.cpp(1859) mongo::_execCommand+0x69
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\dbcommands.cpp(2017) mongo::execCommand+0xc3d
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\dbcommands.cpp(2079) mongo::_runCommands+0x3d7
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\ops\query.cpp(43) mongo::runCommands+0x46
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\ops\query.cpp(919) mongo::runQuery+0x414
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\instance.cpp(244) mongo::receivedQuery+0x16d
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\instance.cpp(390) mongo::assembleResponse+0x2ed
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\db.cpp(192) mongo::MyMessageHandler::process+0xf5
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\util\net\message_server_port.cpp(86) mongo::pms::threadRun+0x59a
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(180) boost::`anonymous namespace'::thread_start_function+0x21
Wed Sep 19 12:50:56 [conn2612] lushstories Assertion failure bmr->extentOfs() < bestmatch.getOfs() src\mongo\db\namespace_details.cpp 388
Wed Sep 19 12:50:56 [initandlisten] connection accepted from 172.16.2.131:49434 #8885 (3478 connections now open)
Wed Sep 19 12:50:56 [conn8885] end connection 172.16.2.131:49434 (3477 connections now open)
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\util\stacktrace.cpp(161) mongo::printStackTrace+0x3e
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\util\assert_util.cpp(109) mongo::verifyFailed+0xdc
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\namespace_details.cpp(388) mongo::NamespaceDetails::__stdAlloc+0x51e
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\namespace_details.cpp(279) mongo::NamespaceDetails::alloc+0x53
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\pdfile.cpp(1292) mongo::allocateSpaceForANewRecord+0x48
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\pdfile.cpp(1496) mongo::DataFileMgr::insert+0x839
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\pdfile.cpp(1217) mongo::DataFileMgr::insertWithObjMod+0x48
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\cloner.cpp(165) mongo::Cloner::Fun::operator()+0x714
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\client\dbclient.cpp(826) mongo::DBClientBase::query+0x102
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\cloner.cpp(217) mongo::Cloner::copy+0x429
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\cloner.cpp(447) mongo::Cloner::go+0x1891
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\cloner.cpp(298) mongo::Cloner::go+0x138
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\pdfile.cpp(1796) mongo::repairDatabase+0x6d8
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\dbcommands.cpp(426) mongo::CmdRepairDatabase::run+0x23e
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\dbcommands.cpp(1859) mongo::_execCommand+0x69
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\dbcommands.cpp(2017) mongo::execCommand+0xc3d
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\dbcommands.cpp(2079) mongo::_runCommands+0x3d7
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\ops\query.cpp(43) mongo::runCommands+0x46
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\ops\query.cpp(919) mongo::runQuery+0x414
Wed Sep 19 12:50:56 [conn2612] mongod.exe ...\src\mongo\db\instance.cpp(244) mongo::receivedQuery+0x16d
Wed Sep 19 12:50:56 [conn2612] killcursors keyUpdates:0 15ms

I'm seeing hundreds of these:

Wed Sep 19 12:50:52 [conn2612] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: lushstories.messages.sent top: { opid: 144619, active: true, secs_running: 2042, op: "query", ns: "lushstories", query:

{ repairDatabase: 1.0 }

, client: "127.0.0.1:49253", desc: "conn2612", connectionId: 2612, locks:

{ ^: "W" }

, waitingForLock: false, msg: "index: (3/3) btree-middle", numYields: 0, lockStats: { timeLockedMicros: {}, timeAcquiringMicros:

{ R: 0, W: 1546 }

} }
Wed Sep 19 12:50:52 [conn2612] warning: ClientCursor::yield can't unlock b/c of recursive lock ns: lushstories.messages.sent top: { opid: 144619, active: true, secs_running: 2042, op: "query", ns: "lushstories", query:

{ repairDatabase: 1.0 }

, client: "127.0.0.1:49253", desc: "conn2612", connectionId: 2612, locks:

{ ^: "W" }

, waitingForLock: false, msg: "index: (3/3) btree-middle", numYields: 0, lockStats: { timeLockedMicros: {}, timeAcquiringMicros:

{ R: 0, W: 1546 }

} }

The site seems quite unstable at the moment, seeing lots of these

Wed Sep 19 13:17:13 [conn1186] SocketException handling request, closing client connection: 9001 socket exception [2] server [172.16.2.131:52465]
Wed Sep 19 13:17:14 [conn186] 68800/133379 51%

Comment by Tad Marshall [ 19/Sep/12 ]

There should be information in the server log that could indicate what the problem is.

This is line 388 in src/mongo/db/namespace_details.cpp:

    verify(bmr->extentOfs() < bestmatch.getOfs());

This is checking to make sure that the deleted record it is trying to use lies within the disk extent that this record claims to be in. Can you see if the log has additional information?

Comment by Gavin Sansom [ 19/Sep/12 ]

Thanks for the info. I tried running repair, after sometime it came back with:

> db.repairDatabase();
{
"errmsg" : "exception: assertion src\\mongo\\db
namespace_details.cpp:3
88",
"code" : 0,
"ok" : 0
}

Comment by Tad Marshall [ 19/Sep/12 ]

Thanks Gavin, this helps a lot.

As the output from validate says, this collection has got a bad link in its chain of deleted records. The "bad deleted loc" listed in "errors" in the validate output shows the exact bad DiskLoc (disk location) that is causing the assertion. You have over four million deleted records, but although their total size is 353 MB, none of them were large enough to satisfy the request for space for the insert, which is why the code was looking in the list for the largest deleted records (bucket 18). Since this list contains a bad record, you will not be able to insert large documents into this collection.

Your best bet is to either run repair on the database or, if this node is part of a working replica set, re-sync this node from one of the others that is up-to-date. Repair requires that you have enough free disk space to recreate the entire database, so you should have 10 GB free (storageSize from your db.messages.sent.stats() output above) to be safe.

We'll keep looking into the cause of this issue and thank you for all the answers!

Comment by Gavin Sansom [ 19/Sep/12 ]

1) What is your mongod.exe server running on? Physical hardware, Amazon Web Services, Windows Azure or some other virtualized environment?
_ It is running on a dedicated server, Intel Xeon X3440 Quad Core - 2.53Ghz, 8M Cache, 2.5GT/s DMI, HyperThreading 16gb RAM RAID-10 _

2) How large is the database containing the collection and how large is the collection? Can you post db.stats() for the database and db.<collection>.stats() for the collection (using the mongo shell), either here or in the https://jira.mongodb.org/browse/SERVER-7068 Jira ticket?

_
> db.messages.sent.stats()
{
"ns" : "lushstories.messages.sent",
"count" : 488694,
"size" : 7198970204,
"avgObjSize" : 14731.038654045271,
"storageSize" : 10132606976,
"numExtents" : 35,
"nindexes" : 3,
"lastExtentSize" : 1868324864,
"paddingFactor" : 1.0000000000165334,
"systemFlags" : 1,
"userFlags" : 1,
"totalIndexSize" : 68907328,
"indexSizes" :

{ "_id_" : 22770160, "mi_1_dc_-1" : 33055568, "dc_1" : 13081600 }

,
"ok" : 1
}
_

3) Could you post db.<collection>.validate() as well? If your collection is not too large, it would be even better to see db.<collection>.validate(true) output, although this will take a long time on a large collection. The output from validate includes some information about the contents of the deletedList that could be informative.

_
> db.messages.sent.validate()
{
"ns" : "lushstories.messages.sent",
"firstExtent" : "3:7ba1000 ns:lushstories.messages.sent",
"lastExtent" : "10:2000 ns:lushstories.messages.sent",
"extentCount" : 35,
"datasize" : 7198973244,
"nrecords" : 488699,
"lastExtentSize" : 1868324864,
"padding" : 1.0000000000165334,
"firstExtentDetails" :

{ "loc" : "3:7ba1000", "xnext" : "3:7bc3000", "xprev" : "null", "nsdiag" : "lushstories.messages.sent", "size" : 8192, "firstRecord" : "null", "lastRecord" : "null" }

,
"deletedCount" : 4093688,
"deletedSize" : 3531009648,
"nIndexes" : 3,
"keysPerIndex" :

{ "lushstories.messages.sent.$_id_" : 488699, "lushstories.messages.sent.$mi_1_dc_-1" : 488699, "lushstories.messages.sent.$dc_1" : 488699 }

,
"valid" : false,
"errors" : [
"bad deleted loc: 73472f62:3258564b bucket:18 k:1"
],
"warning" : "Some checks omitted for speed. use

{full:true}

option to do
more thorough scan.",
"advice" : "ns corrupt, requires repair",
"ok" : 1
}
_

4) The stack trace shows that you were inserting a document; do you know how large that document might have been? The "b:18" bit in the assertion says that the list we were looking at is for deleted records that are 8 MB or larger, but the code will try larger block sizes if there are no free smaller ones. I'm trying to figure out what code path might have brought us here and knowing the record size might be a clue.

_ Its hard to say. This collection is basically a sent items folder for someones messages. Most will be very small, < 1kb however people attaching images will be embedded into the message so these might get quite large. Sorry not much help_

It's happening quite frequently so if you need any more information on this, let me know.

Comment by Tad Marshall [ 19/Sep/12 ]

This trace is from another report: see https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/Hl_4qwnER3g

I'm running MongoDB 2.2 on Windows Server 2008 R2. I've recently enabled TTL collections and now the server is crashing quite consistently. I have included the trace from the log for a recent failure.

This is not a capped collection, its a TTL (lushstories.messages.sent).

There is just once server.

Tue Sep 18 21:04:31 [conn82839]  lushstories.messages.sent ~~ Assertion - cur out of range in _alloc() 73472f62:3258564b a:1934045026 b:18 chain:1
                                mongod.exe    ...\src\mongo\db\namespace_details.cpp(337)                      mongo::NamespaceDetails::__stdAlloc+0x344
Tue Sep 18 21:04:31 [conn82839] mongod.exe    ...\src\mongo\db\pdfile.cpp(1466)                                mongo::DataFileMgr::insert+0x693
Tue Sep 18 21:04:31 [conn82839] mongod.exe    ...\src\mongo\db\pdfile.cpp(1217)                                mongo::DataFileMgr::insertWithObjMod+0x48
Tue Sep 18 21:04:31 [conn82839] mongod.exe    ...\src\mongo\db\instance.cpp(761)                               mongo::checkAndInsert+0xb7
Tue Sep 18 21:04:31 [conn82839] mongod.exe    ...\src\mongo\db\instance.cpp(821)                               mongo::receivedInsert+0x790
Tue Sep 18 21:04:31 [conn82839] mongod.exe    ...\src\mongo\db\instance.cpp(434)                               mongo::assembleResponse+0x5ff
Tue Sep 18 21:04:31 [conn82839] mongod.exe    ...\src\mongo\db\db.cpp(192)                                     mongo::MyMessageHandler::process+0xf5
Tue Sep 18 21:04:31 [conn82839] mongod.exe    ...\src\mongo\util\net\message_server_port.cpp(86)               mongo::pms::threadRun+0x59a
Tue Sep 18 21:04:31 [conn82839] mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(180)  boost::`anonymous namespace'::thread_start_function+0x21
Tue Sep 18 21:04:31 [conn82839] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(314)      _callthreadstartex+0x17
Tue Sep 18 21:04:31 [conn82839] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(292)      _threadstartex+0x7f
Tue Sep 18 21:04:31 [conn82839] kernel32.dll                                                                   BaseThreadInitThunk+0xd
Tue Sep 18 21:04:31 [conn82839] *** unhandled exception (access violation) at 0x000000013F53FE39, terminating
Tue Sep 18 21:04:31 [conn82839] *** access violation was a write to 0x000000590936B610
Tue Sep 18 21:04:31 [conn82839] *** stack trace for unhandled exception:
Tue Sep 18 21:04:32 [conn82839] mongod.exe    ...\src\mongo\db\namespace_details.cpp(339)                      mongo::NamespaceDetails::__stdAlloc+0x359
Tue Sep 18 21:04:32 [conn82839] mongod.exe    ...\src\mongo\db\pdfile.cpp(1466)                                mongo::DataFileMgr::insert+0x693
Tue Sep 18 21:04:32 [conn82839] mongod.exe    ...\src\mongo\db\pdfile.cpp(1217)                                mongo::DataFileMgr::insertWithObjMod+0x48
Tue Sep 18 21:04:32 [conn82839] mongod.exe    ...\src\mongo\db\instance.cpp(761)                               mongo::checkAndInsert+0xb7
Tue Sep 18 21:04:32 [conn82839] mongod.exe    ...\src\mongo\db\instance.cpp(821)                               mongo::receivedInsert+0x790
Tue Sep 18 21:04:32 [conn82839] mongod.exe    ...\src\mongo\db\instance.cpp(434)                               mongo::assembleResponse+0x5ff
Tue Sep 18 21:04:32 [conn82839] mongod.exe    ...\src\mongo\db\db.cpp(192)                                     mongo::MyMessageHandler::process+0xf5
Tue Sep 18 21:04:32 [conn82839] mongod.exe    ...\src\mongo\util\net\message_server_port.cpp(86)               mongo::pms::threadRun+0x59a
Tue Sep 18 21:04:32 [conn82839] mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(180)  boost::`anonymous namespace'::thread_start_function+0x21
Tue Sep 18 21:04:32 [conn82839] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(314)      _callthreadstartex+0x17
Tue Sep 18 21:04:32 [conn82839] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(292)      _threadstartex+0x7f
Tue Sep 18 21:04:32 [conn82839] kernel32.dll                                                                   BaseThreadInitThunk+0xd
Tue Sep 18 21:04:32 [conn82839] writing minidump diagnostic file mongo.dmp
Tue Sep 18 21:04:32 [initandlisten] connection accepted from 172.16.2.131:60651 #85207 (229 connections now open)
Tue Sep 18 21:04:32 dbexit: unhandled exception
Tue Sep 18 21:04:32 [conn82839] shutdown: going to close listening sockets...
Tue Sep 18 21:04:32 [conn82839] closing listening socket: 404
Tue Sep 18 21:04:32 [conn82839] closing listening socket: 412
Tue Sep 18 21:04:32 [conn82839] shutdown: going to flush diaglog...
Tue Sep 18 21:04:32 [conn82839] shutdown: going to close sockets...
Tue Sep 18 21:04:32 [conn82839] shutdown: waiting for fs preallocator...
Tue Sep 18 21:04:32 [conn82839] shutdown: lock for final commit...

The value being displayed for "cur", "73472f62:3258564b", could be the ASCII/UTF-8 text "b/GsKVX2" (same bit pattern).

Generated at Thu Feb 08 03:13:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.