[SERVER-7488] Database lockup during map/reduce job Created: 26/Oct/12 Updated: 10/Dec/14 Resolved: 10/Apr/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | MapReduce |
| Affects Version/s: | 2.2.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Arun Bhalla | Assignee: | Tad Marshall |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux mpac-fe-dev.private.atlassian.com 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
We saw this morning that our staging server running MongoDB 2.2.0 was locked up. Several DB requests were queued up and would not complete. A long-running map/reduce job had just begun and didn't go more than a minute before the updates ceased. mongod backtrace via gdb:
See an attachment for db.currentOp() output. |
| Comments |
| Comment by Arun Bhalla [ 09/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here's an update to the issue. We rolled back to MongoDB 2.0.7 and continued to encounter the issue. We narrowed down the problem to running the same (complex and long-running) map-reduce job in two different connections from two different applications. The problem even occurred by chance in production (at the time with MongoDB 2.0.7) after the single application instance started the map-reduce job, the application instance was restarted, and then the the new application instance happened to start the map-reduce job while the first one was still running, resulting in a lockup. Then we began to see 5-7 minute database freezes at the end of this same map-reduce job, with the following message in the MongoDB log: warning: ClientCursor::yield can't unlock b/c of recursive lock. We started seeing this not long after having to rebuild the database (via a full replica sync). I took a closer look and saw that this map-reduce job queries another collection (plugins) in the reduce function, which I know to be a no-no. We do have other (much smaller) map-reduce jobs that update the plugins collection that may have been running concurrently. I broke the map-reduce job into two map-reduce jobs, each one mapping over a single collection and reducing into a third collection. We continued to see these freezes and the same message despite the ordering of the map-reduce jobs (whether mapping over the smaller collection first or last). A web search sounded like this warning is typical of map/reduce finalization, but I didn't see any reports of database unavailability for this length. If this new unavailability issue is unknown, I can raise an issue for that and try to help diagnose. We're currently working on dropping this map/reduce job and replacing it with incremental updates. This job is the only one that has been causing us trouble, likely due to the relative scale – 4,000,000 documents processed in over an hour vs. ~1,000 documents processed in a few minutes – but we intend to remove other map/reduce jobs from our application as well. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Bhalla [ 29/Oct/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've attached a sample of the downloads collection. It's really the last bit of the collection, but it's fairly representative. The downloads collection has 3841942 documents. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Bhalla [ 29/Oct/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The issue repeated on Friday night as expected, and I was able to get thread backtraces via gdb. I also collected db.currentOp() output for the second occurrence. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tad Marshall [ 27/Oct/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Arun, Thanks for the information. As you said, this should not lock up the server; there may be a bug in the interplay of JavaScript and database locks, and our best bet for fixing it is to reproduce it in-house. I should also have asked for a sample of the documents that are being processed. Anything else we would need to know to reproduce the bug? Let us know if it locks up again, thanks! Tad | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Bhalla [ 26/Oct/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The full log is now attached. So far this has been a one-time event. If it reproduces – I expect so over the weekend – I'll get backtraces from all the threads. I did not let the map-reduce job(s) finish – it had been stuck for over 12 hours, so it was obviously wedged. Normally it completes in less than an hour. Here is Scala snippet for the map/reduce job. This job has run nightly against MongoDB 2.0.x for months with no issue.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Bhalla [ 26/Oct/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This issue might be related to a recent change on our staging server. We have two instances of the same application running, which are both starting scheduled map/reduce jobs at roughly the same time (explaining the similar conn41 and conn76 activity). However, I wouldn't think that two instances of the same long-running job (usually ~30 minutes) should lock up the database. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Tad Marshall [ 26/Oct/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Arun, Can you post the map-reduce code that triggered this? Was this a one-time event, or is it reproducible? Was this map-reduce job working properly with an earlier version of MongoDB? Which version were you using? Did you let the map-reduce job finish and did the server return to normal when it finished? It looks like you have two map-reduce jobs running on the same data at the same time. Is this intentional? Is this the way you always do it? From the timestamps, it looks like neither map-reduce job made any progress for four hours, although from the pace it was moving in the interval between 22:01:32 and 22:01:39 we would expect it to hit 100% in about 65 minutes. Can you post a full log? Tad | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Bhalla [ 26/Oct/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brendan W. McAdams [ 26/Oct/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This has nothing to do with their database driver. I have no input, please assign to kernel team. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Arun Bhalla [ 26/Oct/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We are currently using Casbah 2.1.5-1 to access MongoDB programmatically. |