[SERVER-9500] Operations hang waitingForLock for hours with no yields Created: 29/Apr/13  Updated: 17/Mar/14  Resolved: 17/Mar/14

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.4.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: guy pitelko Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

MongoDB hosted in Azure-based Windows 2012 Server, queries using C# driver, map-reduce using remote mongo.exe


Attachments: PNG File MMS 1.PNG     PNG File MMS 2.PNG     File MR_anon.js     Text File currentOp anon.txt     Text File serverStatus 11.52.txt    
Operating System: Windows
Steps To Reproduce:

Can't reproduce outside of our production env, but this seems to happen each time the long MR operation is executed.

Participants:

 Description   

While running a long mapReduce (on a collection of ~2M records, 8GB disk size), mongo enters a catatonic state where the map reduce operation stops using the local resources (no CPU/Disk activity) but does not complete (when the MR started CPU/Disk resources usage were high).

In db.currentOp() there are 129 ops running, all waiting for a global write lock, never completing and never yielding. Other queries work, but these queries are "stuck" and will never complete until mongo service is restarted. The first op, with the longest run time is the map op, the rest are unrelated ops to the DB from other apps.

Notes:

  • I've waited 20 minutes, and ran db.currentOp() again and performed a diff between the outputs, no fields changes except the 'secs_running'. Especially important is that the 'numYields' doesn't change.
  • Server log shows nothing at this state (only MMS connects). At the time when the MR seems to have stopped working there are no errors, the last entry relating to the MR is: Mon Apr 29 08:14:36.921 [conn11315] M/R: (1/3) Emit Progress: 52700/87827 60%

Attached are:

  • db.currentOp() output (anonymized with ***).
  • db.serverStatus() output
  • Screenshot of MMS
  • mapReduce code (somewhat obfuscated)


 Comments   
Comment by Michael Grundy [ 17/Mar/14 ]

Duplicate of SERVER-3531

Comment by Michael Grundy [ 17/Mar/14 ]

Hi Guy -

It looks like this ticket was lost in the shuffle. My apologies for that. This looks like the problem reported in SERVER-3531, M/R failing to yield while not finding any matching documents. It was fixed in release 2.4.5 . Here is the dump analysis:

Ran windbg against the minidump files:

.\x64\windbg.exe -y Z:\mg\debugsyms\mongodb-win32-x86_64-2.4.1\bin\ -logo Z:\mg\Downloads\Procdump\mongod_130506_151818.dmp.log -z Z:\mg\Downloads\Procdump\mongod_130506_151818.dmp -c ".symopt- 100;!peb;!analyze -v;!runaway; ~* kP; q" 

From the !runaway command we see that thread 13 is at the top of the list:

 User Mode Time
  Thread       Time
  13:1294      0 days 0:11:25.765

Investigating thread 13's stack trace we can see it is in the problematic while loop from SERVER-3531,

  13  Id: 15c8.1294 Suspend: 0 Teb: 000007f6`757b6000 Unfrozen
Child-SP          RetAddr           Call Site
00000000`05c0d1f8 000007fc`710712d2 ntdll!NtWaitForMultipleObjects+0xa
00000000`05c0d200 000007fc`71521292 KERNELBASE!WaitForMultipleObjectsEx+0xe5
00000000`05c0d4e0 000007f6`763c290a kernel32!WaitForMultipleObjects+0x12
00000000`05c0d520 000007f6`763c6d59 mongod!boost::modified_shared_mutex::timed_lock(
			class boost::posix_time::ptime * wait_until = 0x00000000`05c0d638)+0xfa [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\util\concurrency\shared_mutex_win.hpp @ 333]
00000000`05c0d560 000007f6`763c7010 mongod!mongo::Lock::DBWrite::lockOther(
			class mongo::StringData * db = 0x00000000`05c0d740)+0x319 [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\d_concurrency.cpp @ 540]
00000000`05c0d610 000007f6`763a5e59 mongod!mongo::Lock::DBWrite::lockDB(
			class std::basic_string<char,std::char_traits<char>,std::allocator<char> > * ns = 0x00000000`05c0d790)+0x1e0 [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\d_concurrency.cpp @ 576]
00000000`05c0d690 000007f6`763a67e4 mongod!mongo::mr::State::reduceInMemory(void)+0x1c9 [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\commands\mr.cpp @ 946]
00000000`05c0d860 000007f6`763a7d4e mongod!mongo::mr::State::checkSize(void)+0x374 [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\commands\mr.cpp @ 1036]
00000000`05c0d910 000007f6`763e4cbe mongod!mongo::mr::MapReduceCommand::run(
			class std::basic_string<char,std::char_traits<char>,std::allocator<char> > * dbname = 0x49249249`24924925, 
			class mongo::BSONObj * cmd = 0x00000000`2e8cbbf0, 
			int __formal = 0n-1, 
			class std::basic_string<char,std::char_traits<char>,std::allocator<char> > * errmsg = 0x00000000`05c0e898, 
			class mongo::BSONObjBuilder * result = 0x00000000`05c0f0c0, 
			bool fromRepl = false)+0x105e [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\commands\mr.cpp @ 1226]
00000000`05c0e480 000007f6`763e5616 mongod!mongo::_execCommand(
			class mongo::Command * c = 0x000002e1`9bf1331e, 
			class std::basic_string<char,std::char_traits<char>,std::allocator<char> > * dbname = 0x00000000`00d11a90, 
			class mongo::BSONObj * cmdObj = 0x00000000`05c0e740, 
			int queryOptions = 0n13703824, 
			class std::basic_string<char,std::char_traits<char>,std::allocator<char> > * errmsg = 0x00000000`05c0e898, 
			class mongo::BSONObjBuilder * result = 0x00000000`05c0f0c0, 
			bool fromRepl = false)+0x5e [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\dbcommands.cpp @ 1898]
00000000`05c0e640 000007f6`763e5d8d mongod!mongo::Command::execCommand(
			class mongo::Command * c = 0x00000000`05c0ec30, 
			class mongo::Client * client = 0x00000000`00d1a3a0, 
			int queryOptions = 0n96530000, 
			char * cmdns = 0x00000000`2b011030 "--- memory read error at address 0x00000000`2b011030 ---", 
			class mongo::BSONObj * cmdObj = 0x00000000`05c0eb78, 
			class mongo::BSONObjBuilder * result = 0x00000000`05c0f0c0, 
			bool fromRepl = false)+0x936 [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\dbcommands.cpp @ 2007]
00000000`05c0eb30 000007f6`764fcc56 mongod!mongo::_runCommands(
			char * ns = 0x00000000`05c0f800 "51.186", 
			class mongo::BSONObj * _cmdobj = 0xffffffff`ffffffff, 
			class mongo::_BufBuilder<mongo::TrivialAllocator> * b = 0x00000000`00000204, 
			class mongo::BSONObjBuilder * anObjBuilder = 0x000007f6`76b55aab, 
			bool fromRepl = false, 
			int queryOptions = 0n0)+0x3fd [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\dbcommands.cpp @ 2099]
00000000`05c0ed40 000007f6`76501b7f mongod!mongo::runCommands(
			char * ns = 0x00000000`00000000 "", 
			class mongo::BSONObj * jsobj = 0x00000000`00000000, 
			class mongo::CurOp * curop = 0x00000000`098c1ce0, 
			class mongo::_BufBuilder<mongo::TrivialAllocator> * b = 0x00000000`05c0ef70, 
			class mongo::BSONObjBuilder * anObjBuilder = 0x00000000`05c0f0c0, 
			bool fromRepl = false, 
			int queryOptions = 0n0)+0x46 [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\ops\query.cpp @ 46]
00000000`05c0edd0 000007f6`764a1e9d mongod!mongo::runQuery(
			class mongo::Message * m = 0x00000000`00000000, 
			class mongo::QueryMessage * q = 0x000007fc`73d67a22, 
			class mongo::CurOp * curop = 0x00000000`00000000, 
			class mongo::Message * result = 0x00000000`114d5da0)+0x43f [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\ops\query.cpp @ 952]
00000000`05c0f390 000007f6`764a2c70 mongod!mongo::receivedQuery(
			class mongo::Client * c = 0x00000000`00000000, 
			struct mongo::DbResponse * dbresponse = 0x00000000`00000000, 
			class mongo::Message * m = 0x00000000`00000000)+0x35d [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\instance.cpp @ 262]
00000000`05c0f870 000007f6`76b1fffa mongod!mongo::assembleResponse(
			class mongo::Message * m = 0x00000000`02c29e20, 
			struct mongo::DbResponse * dbresponse = 0x00000000`026d1630, 
			struct mongo::HostAndPort * remote = 0x00000000`02c29e20)+0x2f0 [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\instance.cpp @ 401]
00000000`05c0f9e0 000007f6`7672462a mongod!mongo::MyMessageHandler::process(
			class mongo::Message * m = 0x00000000`00000000, 
			class mongo::AbstractMessagingPort * port = 0x00000000`02c29e20, 
			struct mongo::LastError * le = 0x00000000`00000000)+0xfa [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\db.cpp @ 194]
00000000`05c0fb30 000007f6`76777a31 mongod!mongo::PortMessageServer::handleIncomingMsg(
			void * arg = 0x00000000`00000000)+0x56a [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\util\net\message_server_port.cpp @ 207]
00000000`05c0fd40 000007f6`76b5f4cb mongod!boost::`anonymous namespace'::thread_start_function(
			void * param = 0x00000000`00000000)+0x21 [d:\slave\windows_64bit_v2.4.1\mongo\src\third_party\boost\libs\thread\src\win32\thread.cpp @ 180]
00000000`05c0fd80 000007f6`76b5f55f mongod!_callthreadstartex(void)+0x17 [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c @ 314]
00000000`05c0fdb0 000007fc`7152167e mongod!_threadstartex(
			void * ptd = 0x00000000`4b5c1300)+0x7f [f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c @ 292]
00000000`05c0fde0 000007fc`73d83501 kernel32!BaseThreadInitThunk+0x1a
00000000`05c0fe10 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
 

Specifically, this frame:

00000000`05c0d910 000007f6`763e4cbe mongod!mongo::mr::MapReduceCommand::run(
			class std::basic_string<char,std::char_traits<char>,std::allocator<char> > * dbname = 0x49249249`24924925, 
			class mongo::BSONObj * cmd = 0x00000000`2e8cbbf0, 
			int __formal = 0n-1, 
			class std::basic_string<char,std::char_traits<char>,std::allocator<char> > * errmsg = 0x00000000`05c0e898, 
			class mongo::BSONObjBuilder * result = 0x00000000`05c0f0c0, 
			bool fromRepl = false)+0x105e [d:\slave\windows_64bit_v2.4.1\mongo\src\mongo\db\commands\mr.cpp @ 1226]

Comment by guy pitelko [ 06/May/13 ]

Hi Michael,

It's not the above issue - we have an app that performs large queries every hour that loads data from the disk.
Also - the server hasn't been restarted yet, and the same "hung" stats remains (cursors/queues).
We are using Azure VM (2 cores, 3.5GB).
There isn't really a difference, we're always working with the same DB, but the map-reduce that causes this runs on large real-life data. When it's ran on smaller samples of the data the problem doesn't appear.

Minidumps sent separately.

Comment by Michael Grundy [ 03/May/13 ]

Hi Guy -

A couple questions. Are you running on azure VM or PAAS? What are the differences between your prod and dev systems? The page faults in that MMS screen shot have some pretty large spikes. I'm wondering if this might be related to SERVER-9458.

Ultimately, we are going to need more information. In addition to what I've requested above, please capture some minidumps of the system in this state. If you don't already have the procdump.exe utility installed, you can find it at http://technet.microsoft.com/en-us/sysinternals/dd996900.aspx . I'd like to try to get miniplus dumps procdump -mp -s 5 -n 3 <mongod pid>, but straight minidumps might be a more reasonable size procdump -s 5 -n 3 <mongod pid>

If the dumps are too large to attach to the ticket, I'll provide instructions on where to upload larger files. The straight minidumps should be small enough.

Thanks!
Mike

Comment by guy pitelko [ 03/May/13 ]

No comment from the dev team at all, except marking the issue as "debugging with submitter" ?

Comment by guy pitelko [ 30/Apr/13 ]

Any help on this issue ?

Comment by guy pitelko [ 29/Apr/13 ]

More info:
It seems that after performing multiple queries on the DB (from other apps), the MR op has budged a little - the 'numYields' param moved from 566 to 575, but now again there is no CPU/Disk usage and the ops don't seem to be moving.
It's like the locks are starving each other, and adding pressure on the system causes them to move a bit.

The server is still in this state, so let me know if there are any other stats I can get.

Generated at Thu Feb 08 03:20:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.