Details

    • Type: Bug Bug
    • Status: Closed Closed
    • Priority: Major - P3 Major - P3
    • Resolution: Fixed
    • Affects Version/s: 2.0.2
    • Fix Version/s: 2.1.2
    • Labels:
      None
    • Environment:
      Distributor ID: Ubuntu
      Description: Ubuntu 10.04.3 LTS
      Release: 10.04
      Codename: lucid
      Linux data2 2.6.32-33-generic #72-Ubuntu SMP Fri Jul 29 21:07:13 UTC 2011 x86_64 GNU/Linux
    • Backport:
      Cannot
    • Operating System:
      Linux
    • Bug Type:
      Performance
    • # Replies:
      22
    • Last comment by Customer:
      true

      Description

      mongodb hang this and last weekend when doing mapreduce at the very
      early stage. Following is the related log (i think), the full log is
      at https://gist.github.com/2395867

      The db is fully locked without yield and any collection related operations hang. Deadlock maybe?

      The mapreduce code should be fine. After i restart the mongodb and
      rerun the mapreduce, it finished in less than 10 minutes.

      Following is the log and code. Unfortunately, the mongodb log was lost when i restart mongodb.

      mongostat --port 27018 5
      connected to: 127.0.0.1:27018
      insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn        set repl       time 
           0      0      0      0       0       1       0    75g   151g   9.1g      0        0          0       5|3     5|2   113b   424b    16 rs_default    M   08:58:12 
           0      0      0      0       0       1       0    75g   151g   9.1g      0        0          0       5|3     5|2   163b   480b    16 rs_default    M   08:58:17 
      
      PRIMARY> db.currentOp()
      {
             "inprog" : [
                     {
                             "opid" : 15710450,
                             "active" : true,
                             "lockType" : "read",
                             "waitingForLock" : true,
                             "secs_running" : 17965,
                             "op" : "getmore",
                             "ns" : "local.oplog.rs",
                             "query" : {
      
                             },
                             "client" : "192.168.0.21:55171",
                             "desc" : "conn",
                             "threadId" : "0x7f1e8d1f3700",
                             "connectionId" : 5,
                             "numYields" : 0
                     },
                     {
                             "opid" : 15710430,
                             "active" : true,
                             "lockType" : "read",
                             "waitingForLock" : false,
                             "secs_running" : 17966,
                             "op" : "query",
                             "ns" : "app_wowsearch.events.day",
                             "query" : {
                                     "$msg" : "query not recording (too large)"
                             },
                             "client" : "127.0.0.1:33900",
                             "desc" : "conn",
                             "threadId" : "0x7f1e8daf5700",
                             "connectionId" : 19963,
                             "msg" : "m/r: (1/3) emit phase 0/835941 0%",
                             "progress" : {
                                     "done" : 0,
                                     "total" : 835941
                             },
                             "numYields" : 3
                     },
      
      
      db.serverStatus()
      {
             "host" : "data2:27018",
             "version" : "2.0.2",
             "process" : "mongod",
             "uptime" : 602434,
             "uptimeEstimate" : 594271,
             "localTime" : ISODate("2012-04-16T00:59:47.391Z"),
             "globalLock" : {
                     "totalTime" : 602434261059,
                     "lockTime" : 56141056582,
                     "ratio" : 0.09319034492379538,
                     "currentQueue" : {
                             "total" : 8,
                             "readers" : 5,
                             "writers" : 3
                     },
                     "activeClients" : {
                             "total" : 7,
                             "readers" : 5,
                             "writers" : 2
                     }
             },
      

      Here is the map reduce code, a few utility functions are not provided. If you want them, I can provide too.

      function rollup_cycle_users(cycle, firstday) { //{{{
              var m = function() { //{{{
                      var l = this.l || {};
      
                      var mk = {id: this.id, l: {}};
                      var keys = ['s', 'rt', 'rc', 'rf', 'va'];
                      keys.forEach(function(k) {
                              mk.l[k] = l[k] || null;
                      });
      
                      var mv = extend(copy_obj(this, ['model', 'ch', 'city', 'province', 'avn', 'resolution']),
                                      {
                                              pvs: 0,
                                              downloads: 0,
                                              searchs: 0,
                                              video_pls: 0,
                                              days: {},
                                      });
                      var e2dayname = {
                              search: 'sedays',   
                              download: 'dldays', 
                              pv: 'pvdays',       
                              video_play: 'vpdays',
                      };
                      switch(this.e){
                              case 'download' : mv.downloads += this.c || 1; break;
                              case 'pv' : mv.pvs += this.c || 1; break;
                              case 'search' : mv.searchs += this.c || 1; break;
                              case 'video_play' : mv.video_pls += this.c || 1; break;
                      }
                      var day_name = e2dayname[this.e];
                      if (day_name) {
                              var d = 1 + (this.ts.getTime() - base_ts.getTime()) / (24 * 3600000); // start from 1
                              mv.days[day_name] = {};
                              mv.days[day_name][d] = true;
                              emit(mk, mv);       
                      }
              }; //}}}
      
              var r = function(key, emits) { //{{{
                      var total0 = {downloads: 0, pvs: 0, video_pls:0, searchs:0, days: {}};
                      if (emits.length == 0) {    
                              return total0;      
                      }
                      var total = extend(emits[0], total0);
      
                      var tdays = total.days;     
                      emits.forEach(function(emit) {
                              for (day_name in emit.days) {
                                      tdays[day_name] = extend(tdays[day_name] || {}, emit.days[day_name]);
                                      tdays[day_name] = tdays[day_name] || {};
                                      for (d in emit.days[day_name]) {
                                              tdays[day_name][d] = true;
                                      }
                              }
                              total.downloads += emit.downloads;
                              total.pvs += emit.pvs;
                              total.video_pls += emit.video_pls;
                              total.searchs += emit.searchs;
                      });
                      return total;
              }; //}}}
      
              var f = function(key, total) { //{{{
                      function count(days) {      
                              days = days || [];  
                              return days.length; 
                      }
                      function max_continuous_count(days) { //{{{
                              days = days || [];  
                              if (!days.length) return 0;
      
                              var max_ccount = 0; 
                              var ccount = 0;     
                              days = days.concat([-1]); // guard
                              for (i in days) {   
                                      if (days[i] - days[i-1] != 1 || i == 0) {
                                              max_ccount = Math.max(max_ccount, ccount);
                                              ccount = 1;
                                      }
                                      else {      
                                              ccount += 1;
                                      }
                              }
      
                              return max_ccount;  
                      } //}}}
                      function max_delta(days) {  
                              deltas = delta(days);
                              if (deltas.length == 0) return 0;
                              return Math.max.apply(null, deltas);
                      }
                      function min_delta(days) {  
                              deltas = delta(days);
                              if (deltas.length == 0) return 0;
                              return Math.min.apply(null, deltas);
                      }
      
      
                      for (d in total.days) {     
                              total.days[d] = keys(total.days[d]).sort(function(a, b) {return a-b;});
                      }
                      var days2al = {pvdays: 'pv', dldays: 'dl', vpdays: 'vp', sedays: 'se'}; //detail in map function 访问,下载,播放,搜索天数
      
                      var pvdays = total.days.pvdays || [];
                      var dldays = total.days.dldays || [];
                      aldays = {};
                      for (cdi in total.days) {   
                              cd = total.days[cdi];
                              if (!cd) continue;  
                              aldays[days2al[cdi]] = {
                                      d: count(cd),
                                      contd: max_continuous_count(cd),
                                      mindd: min_delta(cd),
                                      maxdd: max_delta(cd),
                              };
                      }
      
                      var min_dl_day = -1;
                      if (dldays.length) {
                              min_dl_day = Math.min(dldays);
                      }
                      dlpvdays = filter(pvdays, function(d) {
                              return d > min_dl_day;
                      });
                      dlpvd = count(dlpvdays);    
      
                      total = extend(total, {dlpvd: dlpvd});
                      total = extend(total, aldays);
      
                      return total;
              } //}}}
      
              var cycle2delta = {
                      week: '7d',
                      month: '1m',
              };
      
              var colname = get_scale_col_name(cycle, firstday);
              var mr_colname = 'mr.' + colname;   
              var ts0 = get_china_day(firstday);  
              var ts1 = date_add(ts0, cycle2delta[cycle]);
              var query = {ts: {$gte: ts0, $lt: ts1}}; // [ts0, ts1)
              db.events.day.mapReduce(
                              m, r, {
                                      verbose: true,
                                      out: mr_colname,
                                      finalize: f,
                                      query: query,
                                      scope: {    
                                              base_ts: ts0,
                                              copy_obj: copy_obj,
                                              filter: filter,
                                              extend: extend,
                                              delta: delta,
                                              keys: keys,
                                      },
                                      jsMode: true,
                              }
                              );
      
              db[mr_colname].find().forEach(function(doc) {
                      var ndoc = extend(doc._id, doc.value);
                      db[colname].insert(ndoc);   
              });
      
      } //}}}
      

        Activity

        Hide
        Brett Kiefer
        added a comment -

        Thanks, then that stopgap might make sense for us, since no single query should be generating a lot of objects for GC. If this is indeed our problem, then we've seen up to 90-second garbage collection cycles. Read and write locks aside, that does seem like a long time for the JS engine to be tied up, so I like your solution 3 above, though it sounds like 1 or 2 would be a good idea, too. Thanks!

        Show
        Brett Kiefer
        added a comment - Thanks, then that stopgap might make sense for us, since no single query should be generating a lot of objects for GC. If this is indeed our problem, then we've seen up to 90-second garbage collection cycles. Read and write locks aside, that does seem like a long time for the JS engine to be tied up, so I like your solution 3 above, though it sounds like 1 or 2 would be a good idea, too. Thanks!
        Hide
        Ping Yin
        added a comment -

        > This is because the group command and map/reduce both grabs the js global interpreter lock. So only one of them can be running at a time.
        @Randolph Tan The global lock is ok for me. But why they lock each other forever? Can't the dead lock problem be avoided? i.e. group run first (map reduce waiting for lock), and then the map reduce run

        Show
        Ping Yin
        added a comment - > This is because the group command and map/reduce both grabs the js global interpreter lock. So only one of them can be running at a time. @Randolph Tan The global lock is ok for me. But why they lock each other forever? Can't the dead lock problem be avoided? i.e. group run first (map reduce waiting for lock), and then the map reduce run
        Hide
        auto
        added a comment -

        Author:

        {u'date': u'2012-06-12T19:24:37-07:00', u'email': u'antoine@10gen.com', u'name': u'agirbal'}

        Message: SERVER-5627: mongodb hangs when doing mapreduce
        Branch: master
        https://github.com/mongodb/mongo/commit/bbc3ee201c0cd0a01d15c486ddfba0fb353f46e9

        Show
        auto
        added a comment - Author: {u'date': u'2012-06-12T19:24:37-07:00', u'email': u'antoine@10gen.com', u'name': u'agirbal'} Message: SERVER-5627 : mongodb hangs when doing mapreduce Branch: master https://github.com/mongodb/mongo/commit/bbc3ee201c0cd0a01d15c486ddfba0fb353f46e9
        Hide
        Antoine Girbal
        added a comment -

        for now the following was done to mitigate the issue:

        • yield lock every 100 instead of 1000 when processing map or reduce
        • get rid of JS context every 10 use instead of 100
        Show
        Antoine Girbal
        added a comment - for now the following was done to mitigate the issue: yield lock every 100 instead of 1000 when processing map or reduce get rid of JS context every 10 use instead of 100
        Hide
        Antoine Girbal
        added a comment -

        Ideally the behavior should be:

        • yield based on time elapsed instead of number of items (mr.cpp line 1101).
          This way it would help the db if MR is in a tough spot and every item item processed takes a while.
          Problem is that the yielding code cannot easily be replaced with yieldSometimes() since it tries to do operations while outside lock (calls checkSize which usually do reduces).
        • throw away JS context based on its size / number of objects.
          Could not find an easy way to get stats on the context, typically if it is small there is no reason to throw it away.
          Otherwise could try to use the overall heap size.

        Based on above 2 items, may need another ticket scheduled for later version.

        Show
        Antoine Girbal
        added a comment - Ideally the behavior should be: yield based on time elapsed instead of number of items (mr.cpp line 1101). This way it would help the db if MR is in a tough spot and every item item processed takes a while. Problem is that the yielding code cannot easily be replaced with yieldSometimes() since it tries to do operations while outside lock (calls checkSize which usually do reduces). throw away JS context based on its size / number of objects. Could not find an easy way to get stats on the context, typically if it is small there is no reason to throw it away. Otherwise could try to use the overall heap size. Based on above 2 items, may need another ticket scheduled for later version.

          People

          • Votes:
            4 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:
              Days since reply:
              1 year, 45 weeks, 1 day ago
              Date of 1st Reply: