[SERVER-7488] Database lockup during map/reduce job Created: 26/Oct/12  Updated: 10/Dec/14  Resolved: 10/Apr/13

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Arun Bhalla Assignee: Tad Marshall
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux mpac-fe-dev.private.atlassian.com 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64 x86_64 x86_64 GNU/Linux
16GB RAM


Attachments: File currentOp-2.json     File currentOp.json     File downloads.json     File gdb-2.out     Text File mongodb.log    
Operating System: Linux
Participants:

 Description   

We saw this morning that our staging server running MongoDB 2.2.0 was locked up. Several DB requests were queued up and would not complete. A long-running map/reduce job had just begun and didn't go more than a minute before the updates ceased.

mongod backtrace via gdb:

#0  0x0000003f1bae0d03 in select () from /lib64/libc.so.6
#1  0x00000000006a4a5a in mongo::Listener::initAndListen() ()
#2  0x0000000000559154 in mongo::listen(int) ()
#3  0x000000000055c86a in mongo::_initAndListen(int) ()
#4  0x000000000055d2dd in mongo::initAndListen(int) ()
#5  0x0000000000563dbe in ?? ()
#6  0x0000000000565399 in main ()

See an attachment for db.currentOp() output.



 Comments   
Comment by Arun Bhalla [ 09/Jan/13 ]

Here's an update to the issue.

We rolled back to MongoDB 2.0.7 and continued to encounter the issue. We narrowed down the problem to running the same (complex and long-running) map-reduce job in two different connections from two different applications. The problem even occurred by chance in production (at the time with MongoDB 2.0.7) after the single application instance started the map-reduce job, the application instance was restarted, and then the the new application instance happened to start the map-reduce job while the first one was still running, resulting in a lockup.

Then we began to see 5-7 minute database freezes at the end of this same map-reduce job, with the following message in the MongoDB log: warning: ClientCursor::yield can't unlock b/c of recursive lock. We started seeing this not long after having to rebuild the database (via a full replica sync). I took a closer look and saw that this map-reduce job queries another collection (plugins) in the reduce function, which I know to be a no-no. We do have other (much smaller) map-reduce jobs that update the plugins collection that may have been running concurrently. I broke the map-reduce job into two map-reduce jobs, each one mapping over a single collection and reducing into a third collection. We continued to see these freezes and the same message despite the ordering of the map-reduce jobs (whether mapping over the smaller collection first or last). A web search sounded like this warning is typical of map/reduce finalization, but I didn't see any reports of database unavailability for this length. If this new unavailability issue is unknown, I can raise an issue for that and try to help diagnose.

We're currently working on dropping this map/reduce job and replacing it with incremental updates. This job is the only one that has been causing us trouble, likely due to the relative scale – 4,000,000 documents processed in over an hour vs. ~1,000 documents processed in a few minutes – but we intend to remove other map/reduce jobs from our application as well.

Comment by Arun Bhalla [ 29/Oct/12 ]

I've attached a sample of the downloads collection. It's really the last bit of the collection, but it's fairly representative.

The downloads collection has 3841942 documents.

Comment by Arun Bhalla [ 29/Oct/12 ]

The issue repeated on Friday night as expected, and I was able to get thread backtraces via gdb.

I also collected db.currentOp() output for the second occurrence.

Comment by Tad Marshall [ 27/Oct/12 ]

Hi Arun,

Thanks for the information. As you said, this should not lock up the server; there may be a bug in the interplay of JavaScript and database locks, and our best bet for fixing it is to reproduce it in-house.

I should also have asked for a sample of the documents that are being processed. Anything else we would need to know to reproduce the bug?

Let us know if it locks up again, thanks!

Tad

Comment by Arun Bhalla [ 26/Oct/12 ]

The full log is now attached. So far this has been a one-time event. If it reproduces – I expect so over the weekend – I'll get backtraces from all the threads. I did not let the map-reduce job(s) finish – it had been stuck for over 12 hours, so it was obviously wedged. Normally it completes in less than an hour.

Here is Scala snippet for the map/reduce job. This job has run nightly against MongoDB 2.0.x for months with no issue.

  val map = """
    function () {
      var setToMidnight = function(date) {
        date.setUTCHours(0);
        date.setUTCMinutes(0);
        date.setUTCSeconds(0);
        date.setUTCMilliseconds(0);
      };
      var pluginId = this.pluginId;
      var downloadDay = new Date(this.time.getTime());
      setToMidnight(downloadDay);
      var weekStart = new Date(downloadDay.getTime());
      // this uses Monday as the start of the week, since that is what most of the world 
      // (and more importantly, Joda Time) considers to be the start of the week.
      weekStart.setUTCDate(weekStart.getUTCDate() - weekStart.getUTCDay() + 1);
      var monthStart = new Date(downloadDay.getTime());
      monthStart.setUTCDate(1);
      // if this download is recent enough, count it:
      var recentCount = 0;
      if (this.time.getTime() > %d) {
        recentCount = 1;
      }
      var weekCountObj = {startDate: weekStart, totalCount: 1};
      var monthCountObj = {startDate: monthStart, totalCount: 1};
      var emitObj = {pluginId: pluginId, totalCount: 1, recentCount: recentCount, weekCounts: [weekCountObj], monthCounts: [monthCountObj], versions: {}};
      var versionObj = {totalCount: 1, recentCount: recentCount, weekCounts: [weekCountObj], monthCounts: [monthCountObj]};
      emitObj.versions[this.buildNumber.toNumber()] = versionObj;
      emit(pluginId, emitObj);
    }""".format((DateTime.now minus (14 days)).getMillis)
  val reduce = """
    function (key, values) {
      var setToMidnight = function(date) {
        date.setUTCHours(0);
        date.setUTCMinutes(0);
        date.setUTCSeconds(0);
        date.setUTCMilliseconds(0);
        return date;
      };
      var result = {totalCount: 0, recentCount: 0, weekCounts: [], monthCounts: [], versions: {}};
      values.forEach(function(value) {
        if (!result.pluginId && value.pluginId) {
          result.pluginId = value.pluginId;
          var pluginResult = db.plugins.find({'_id': value.pluginId});
          pluginResult.forEach(function (plugin) {
            result.plugin = plugin;
          });
        }
        result.totalCount += value.totalCount;
        result.recentCount += value.recentCount;
        // returns the first value in the array that satisfies the predicate, or the default value 
        // if none satisfy (in this case, the default value is also added to the array in a terrible 
        // display of mutability).
        function findCount(array, startDate, valueIfNotFound) {
          var predicate = function (v) {
            return setToMidnight(v.startDate).getTime() === setToMidnight(startDate).getTime();
          }
          var filtered = array.filter(predicate)
          if (filtered.length > 0) {
            return filtered[0];
          } else {
            array.push(valueIfNotFound);
            return valueIfNotFound;
          }
        }
        value.weekCounts.forEach(function (tally) {
          var currentCount = findCount(result.weekCounts, tally.startDate, {startDate: tally.startDate, totalCount: 0});
          currentCount.totalCount += tally.totalCount;
        });
        value.monthCounts.forEach(function (tally) {
          var currentCount = findCount(result.monthCounts, tally.startDate, {startDate: tally.startDate, totalCount: 0});
          currentCount.totalCount += tally.totalCount;
        });
        for (var version in value.versions) {
          if (!result.versions[version]) {
            result.versions[version] = {weekCounts: [], monthCounts: [], totalCount: 0, recentCount: 0};
          }
          value.versions[version].weekCounts.forEach(function (tally) {
            var currentCount = findCount(result.versions[version].weekCounts, tally.startDate, {startDate: tally.startDate, totalCount: 0});
            currentCount.totalCount += tally.totalCount;
          });
          value.versions[version].monthCounts.forEach(function (tally) {
            var currentCount = findCount(result.versions[version].monthCounts, tally.startDate, {startDate: tally.startDate, totalCount: 0});
            currentCount.totalCount += tally.totalCount;
          });
          result.versions[version].totalCount += value.versions[version].totalCount;
          result.versions[version].recentCount += value.versions[version].recentCount;
        }
      });
      return result;
    }"""

Comment by Arun Bhalla [ 26/Oct/12 ]

This issue might be related to a recent change on our staging server. We have two instances of the same application running, which are both starting scheduled map/reduce jobs at roughly the same time (explaining the similar conn41 and conn76 activity). However, I wouldn't think that two instances of the same long-running job (usually ~30 minutes) should lock up the database.

Comment by Tad Marshall [ 26/Oct/12 ]

Hi Arun,

Can you post the map-reduce code that triggered this?

Was this a one-time event, or is it reproducible?

Was this map-reduce job working properly with an earlier version of MongoDB? Which version were you using?

Did you let the map-reduce job finish and did the server return to normal when it finished?

It looks like you have two map-reduce jobs running on the same data at the same time. Is this intentional? Is this the way you always do it?

From the timestamps, it looks like neither map-reduce job made any progress for four hours, although from the pace it was moving in the interval between 22:01:32 and 22:01:39 we would expect it to hit 100% in about 65 minutes.

Can you post a full log?

Tad

Comment by Arun Bhalla [ 26/Oct/12 ]

@400000005089fcba011bddec Thu Oct 25 22:00:00 [conn76] CMD: drop pac.tmp.mr.downloads_1394_inc
@400000005089fcba01312ef4 Thu Oct 25 22:00:00 [conn76] build index pac.tmp.mr.downloads_1394_inc { 0: 1 }
@400000005089fcba01508a4c Thu Oct 25 22:00:00 [conn76] build index done.  scanned 0 total records. 0.001 secs
@400000005089fcba016473f4 Thu Oct 25 22:00:00 [conn76] CMD: drop pac.tmp.mr.downloads_1394
[...]
@400000005089fcba02e72fdc Thu Oct 25 22:00:00 [conn41] CMD: drop pac.tmp.mr.downloads_1395_inc
@400000005089fcba02f8eaec Thu Oct 25 22:00:00 [conn41] build index pac.tmp.mr.downloads_1395_inc { 0: 1 }
@400000005089fcba03142f64 Thu Oct 25 22:00:00 [conn41] build index done.  scanned 0 total records. 0.001 secs
@400000005089fcba031e744c Thu Oct 25 22:00:00 [conn41] CMD: drop pac.tmp.mr.downloads_1395
[...]
@400000005089fd160af69e24 Thu Oct 25 22:01:32 [conn41]          100700/3841935  2%
@400000005089fd162bc5391c Thu Oct 25 22:01:32 [conn76]          100700/3841935  2%
@400000005089fd19145669cc Thu Oct 25 22:01:35 [conn41]          103500/3841935  2%
@400000005089fd1a03f661a4 Thu Oct 25 22:01:36 [conn76]          103500/3841935  2%
@400000005089fd1d09b67534 Thu Oct 25 22:01:39 [conn41]          107600/3841935  2%
@400000005089fd1d2d4754dc Thu Oct 25 22:01:39 [conn76]          107600/3841935  2%
@40000000508a365939296a4c Fri Oct 26 02:05:51 [initandlisten] connection accepted from 127.0.0.1:51321 #103 (103 connections now open)
@40000000508a365a08762534 Fri Oct 26 02:05:52 [initandlisten] connection accepted from 127.0.0.1:51322 #104 (104 connections now open)
@40000000508acfbc27a77b4c Fri Oct 26 13:00:18 [initandlisten] connection accepted from 127.0.0.1:55333 #105 (105 connections now open)
@40000000508acfbd1f4d1d6c Fri Oct 26 13:00:19 [conn105] end connection 127.0.0.1:55333 (104 connections now open)
@40000000508ad36c1eedcefc Fri Oct 26 13:16:02 [conn103] end connection 127.0.0.1:51321 (103 connections now open)

Comment by Brendan W. McAdams [ 26/Oct/12 ]

This has nothing to do with their database driver. I have no input, please assign to kernel team.

Comment by Arun Bhalla [ 26/Oct/12 ]

We are currently using Casbah 2.1.5-1 to access MongoDB programmatically.

Generated at Thu Feb 08 03:14:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.