[SERVER-9500] Operations hang waitingForLock for hours with no yields Created: 29/Apr/13 Updated: 17/Mar/14 Resolved: 17/Mar/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | guy pitelko | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
MongoDB hosted in Azure-based Windows 2012 Server, queries using C# driver, map-reduce using remote mongo.exe |
||
| Attachments: |
|
| Operating System: | Windows |
| Steps To Reproduce: | Can't reproduce outside of our production env, but this seems to happen each time the long MR operation is executed. |
| Participants: |
| Description |
|
While running a long mapReduce (on a collection of ~2M records, 8GB disk size), mongo enters a catatonic state where the map reduce operation stops using the local resources (no CPU/Disk activity) but does not complete (when the MR started CPU/Disk resources usage were high). In db.currentOp() there are 129 ops running, all waiting for a global write lock, never completing and never yielding. Other queries work, but these queries are "stuck" and will never complete until mongo service is restarted. The first op, with the longest run time is the map op, the rest are unrelated ops to the DB from other apps. Notes:
Attached are:
|
| Comments |
| Comment by Michael Grundy [ 17/Mar/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Duplicate of | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Grundy [ 17/Mar/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Guy - It looks like this ticket was lost in the shuffle. My apologies for that. This looks like the problem reported in Ran windbg against the minidump files:
From the !runaway command we see that thread 13 is at the top of the list:
Investigating thread 13's stack trace we can see it is in the problematic while loop from
Specifically, this frame:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by guy pitelko [ 06/May/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Michael, It's not the above issue - we have an app that performs large queries every hour that loads data from the disk. Minidumps sent separately. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Grundy [ 03/May/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Guy - A couple questions. Are you running on azure VM or PAAS? What are the differences between your prod and dev systems? The page faults in that MMS screen shot have some pretty large spikes. I'm wondering if this might be related to Ultimately, we are going to need more information. In addition to what I've requested above, please capture some minidumps of the system in this state. If you don't already have the procdump.exe utility installed, you can find it at http://technet.microsoft.com/en-us/sysinternals/dd996900.aspx . I'd like to try to get miniplus dumps procdump -mp -s 5 -n 3 <mongod pid>, but straight minidumps might be a more reasonable size procdump -s 5 -n 3 <mongod pid> If the dumps are too large to attach to the ticket, I'll provide instructions on where to upload larger files. The straight minidumps should be small enough. Thanks! | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by guy pitelko [ 03/May/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
No comment from the dev team at all, except marking the issue as "debugging with submitter" ? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by guy pitelko [ 30/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Any help on this issue ? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by guy pitelko [ 29/Apr/13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
More info: The server is still in this state, so let me know if there are any other stats I can get. |