Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-2918

Server shuts down once a day, without being able to reproduce the crash

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major - P3 Major - P3
    • None
    • 1.6.5
    • Stability
    • None
    • ALL

    Description

      We've been having problems with mongodb since updating to 1.6.5 (about 1 month ago). It is a single server configuration - we have 7 GB in (data + indexes) and mongodb usually stays at around 7-8 GB RAM, but the machine has 15 GB RAM in total and 4 GB swap.

      The database is crashing every day with two different patterns:
      1) Mongodb fills the whole memory on the server (all the 15 GB and the whole 4 GB swap) very fast (so it goes from 7-8 GB to 15 + 4 in about a minute); after that, nothing seems to be able to shut the server down (we tried db.shutdownServer(), kill -15, kill -9) - it's blocked, it accepts new connections, but doesn't return anything; the only shut down procedure that seems to work is shutting the whole machine down. After restarting the machine, the lock file is present so mongodb won't start, but after removing the lock file, the server starts and it's ok again (for a day or two).
      2) The second pattern is not present every day, more like once every 5 days - the database just shuts down, freeing all the memory. Since I just observed this problem, I can isolate the connection that seems to have caused the shutdown. Here is how the log looks like, for this connection:

      Fri Apr 8 10:37:14 [conn58463] run command jbm.$cmd { mapreduce: "mongocvs", map: CodeWScope( function() { var x; var folders = this.employers[96012].folders; for(x in folders)

      {emit(x,1);}

      }, {}), reduce: CodeWScope( function(k, vals) { var sum = 0; for (var i in vals)

      { sum += vals[i]; }

      return sum; }, {}), query: { empids: 96012, $or: [

      { folders: 815493 }

      ,

      { folders: 814868 }

      ,

      { folders: 813880 }

      ,

      { folders: 811648 }

      ,

      { folders: 811647 }

      ,

      { folders: 811644 }

      ,

      { folders: 811643 }

      ,

      { folders: 811642 }

      ,

      { folders: 811640 }

      ,

      { folders: 811639 }

      ,

      { folders: 811638 }

      ,

      { folders: 811637 }

      ,

      { folders: 806672 }

      ,

      { folders: 806671 }

      ,

      { folders: 806670 }

      ,

      { folders: 806583 }

      ,

      { folders: 806493 }

      ,

      { folders: 805036 }

      ,

      { folders: 805035 }

      ,

      { folders: 801933 }

      ,

      { folders: 801932 }

      ,

      { folders: 801929 }

      ,

      { folders: 801648 }

      ,

      { folders: 800608 }

      ,

      { folders: 798096 }

      ,

      { folders: 797344 }

      ,

      { folders: 797306 }

      ,

      { folders: 796627 }

      ,

      { folders: 796626 }

      ,

      { folders: 796625 }

      ] }, out: "mr_count_folder_cvs" }
      Fri Apr 8 10:37:14 [conn58463] mr ns: jbm.mongocvs
      Fri Apr 8 10:37:14 [conn58463] run command jbm.$cmd { count: "system.namespaces", query:

      { name: "jbm.mongocvs" }

      }
      Fri Apr 8 10:37:14 [conn58463] query jbm.$cmd ntoreturn:1 command: { count: "system.namespaces", query:

      { name: "jbm.mongocvs" }

      } reslen:64 0ms
      Fri Apr 8 10:37:14 [conn58463] query jbm.system.js reslen:36 nreturned:0 0ms
      Fri Apr 8 10:37:14 [conn58463] run command jbm.$cmd

      { drop: "tmp.mr.mapreduce_1302248234_47440" }

      Fri Apr 8 10:37:14 [conn58463] CMD: drop jbm.tmp.mr.mapreduce_1302248234_47440
      Fri Apr 8 10:37:14 [conn58463] query jbm.$cmd ntoreturn:1 command:

      { drop: "tmp.mr.mapreduce_1302248234_47440" }

      reslen:78 0ms
      Fri Apr 8 10:37:14 [conn58463] run command jbm.$cmd

      { drop: "tmp.mr.mapreduce_1302248234_47440_inc" }

      Fri Apr 8 10:37:14 [conn58463] CMD: drop jbm.tmp.mr.mapreduce_1302248234_47440_inc
      Fri Apr 8 10:37:14 [conn58463] query jbm.$cmd ntoreturn:1 command:

      { drop: "tmp.mr.mapreduce_1302248234_47440_inc" }

      reslen:78 0ms
      Fri Apr 8 10:37:14 [conn58463] create collection jbm.tmp.mr.mapreduce_1302248234_47440_inc

      { autoIndexId: 0 }

      Fri Apr 8 10:55:32 [conn58463] mr failed, removing collection
      Fri Apr 8 10:55:32 [conn58463] run command jbm.$cmd

      { drop: "tmp.mr.mapreduce_1302248234_47440" }

      Fri Apr 8 10:57:01 [conn58463] CMD: drop jbm.tmp.mr.mapreduce_1302248234_47440
      Fri Apr 8 10:57:01 [conn58463] query jbm.$cmd ntoreturn:1 command:

      { drop: "tmp.mr.mapreduce_1302248234_47440" }

      reslen:78 88937ms
      Fri Apr 8 10:57:01 [conn58463] run command jbm.$cmd

      { drop: "tmp.mr.mapreduce_1302248234_47440_inc" }

      Fri Apr 8 10:57:03 [conn58463] CMD: drop jbm.tmp.mr.mapreduce_1302248234_47440_inc
      Fri Apr 8 10:57:03 [conn58463] dropCollection: jbm.tmp.mr.mapreduce_1302248234_47440_inc
      Fri Apr 8 10:57:03 [conn58463] dropIndexes done
      Fri Apr 8 10:57:03 [conn58463] query jbm.$cmd ntoreturn:1 command:

      { drop: "tmp.mr.mapreduce_1302248234_47440_inc" }

      reslen:105 2710ms
      Fri Apr 8 10:57:04 [conn58463] Uncaught std::exception: St9bad_alloc, terminating
      Fri Apr 8 10:57:04 [conn58463] shutdown: going to close listening sockets...
      Fri Apr 8 10:57:04 [conn58463] closing listening socket: 5
      Fri Apr 8 10:57:04 [conn58463] closing listening socket: 6
      Fri Apr 8 10:57:04 [conn58463] closing listening socket: 7
      Fri Apr 8 10:57:04 [conn58463] closing listening socket: 8
      Fri Apr 8 10:57:04 [conn58463] shutdown: going to flush oplog...
      Fri Apr 8 10:57:04 [conn58463] shutdown: going to close sockets...
      Fri Apr 8 10:57:04 [conn58463] shutdown: waiting for fs preallocator...
      Fri Apr 8 10:57:04 [conn58463] shutdown: closing all files...
      Fri Apr 8 10:57:04 [conn58463] shutdown: removing fs lock...

      Our plans for the immediate future are to upgrade to 1.8 version and start another machine, transforming the system into a replica set - and that is what we are working at right now. But I still want to investigate this problem and be sure that it won't be present after the upgrade.

      Notes:

      • unfortunately the shutdown I just described is not reproducible - I tried the same sequence of commands and it didn't crash
      • as you can see from the log, we use map-reduce too

      Attachments

        Activity

          People

            Unassigned Unassigned
            catalint Catalin Trandafir
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: