[SERVER-2918] Server shuts down once a day, without being able to reproduce the crash Created: 08/Apr/11  Updated: 12/Jul/16  Resolved: 08/Apr/11

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 1.6.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Catalin Trandafir Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

We've been having problems with mongodb since updating to 1.6.5 (about 1 month ago). It is a single server configuration - we have 7 GB in (data + indexes) and mongodb usually stays at around 7-8 GB RAM, but the machine has 15 GB RAM in total and 4 GB swap.

The database is crashing every day with two different patterns:
1) Mongodb fills the whole memory on the server (all the 15 GB and the whole 4 GB swap) very fast (so it goes from 7-8 GB to 15 + 4 in about a minute); after that, nothing seems to be able to shut the server down (we tried db.shutdownServer(), kill -15, kill -9) - it's blocked, it accepts new connections, but doesn't return anything; the only shut down procedure that seems to work is shutting the whole machine down. After restarting the machine, the lock file is present so mongodb won't start, but after removing the lock file, the server starts and it's ok again (for a day or two).
2) The second pattern is not present every day, more like once every 5 days - the database just shuts down, freeing all the memory. Since I just observed this problem, I can isolate the connection that seems to have caused the shutdown. Here is how the log looks like, for this connection:

Fri Apr 8 10:37:14 [conn58463] run command jbm.$cmd { mapreduce: "mongocvs", map: CodeWScope( function() { var x; var folders = this.employers[96012].folders; for(x in folders)

{emit(x,1);}

}, {}), reduce: CodeWScope( function(k, vals) { var sum = 0; for (var i in vals)

{ sum += vals[i]; }

return sum; }, {}), query: { empids: 96012, $or: [

{ folders: 815493 }

,

{ folders: 814868 }

,

{ folders: 813880 }

,

{ folders: 811648 }

,

{ folders: 811647 }

,

{ folders: 811644 }

,

{ folders: 811643 }

,

{ folders: 811642 }

,

{ folders: 811640 }

,

{ folders: 811639 }

,

{ folders: 811638 }

,

{ folders: 811637 }

,

{ folders: 806672 }

,

{ folders: 806671 }

,

{ folders: 806670 }

,

{ folders: 806583 }

,

{ folders: 806493 }

,

{ folders: 805036 }

,

{ folders: 805035 }

,

{ folders: 801933 }

,

{ folders: 801932 }

,

{ folders: 801929 }

,

{ folders: 801648 }

,

{ folders: 800608 }

,

{ folders: 798096 }

,

{ folders: 797344 }

,

{ folders: 797306 }

,

{ folders: 796627 }

,

{ folders: 796626 }

,

{ folders: 796625 }

] }, out: "mr_count_folder_cvs" }
Fri Apr 8 10:37:14 [conn58463] mr ns: jbm.mongocvs
Fri Apr 8 10:37:14 [conn58463] run command jbm.$cmd { count: "system.namespaces", query:

{ name: "jbm.mongocvs" }

}
Fri Apr 8 10:37:14 [conn58463] query jbm.$cmd ntoreturn:1 command: { count: "system.namespaces", query:

{ name: "jbm.mongocvs" }

} reslen:64 0ms
Fri Apr 8 10:37:14 [conn58463] query jbm.system.js reslen:36 nreturned:0 0ms
Fri Apr 8 10:37:14 [conn58463] run command jbm.$cmd

{ drop: "tmp.mr.mapreduce_1302248234_47440" }

Fri Apr 8 10:37:14 [conn58463] CMD: drop jbm.tmp.mr.mapreduce_1302248234_47440
Fri Apr 8 10:37:14 [conn58463] query jbm.$cmd ntoreturn:1 command:

{ drop: "tmp.mr.mapreduce_1302248234_47440" }

reslen:78 0ms
Fri Apr 8 10:37:14 [conn58463] run command jbm.$cmd

{ drop: "tmp.mr.mapreduce_1302248234_47440_inc" }

Fri Apr 8 10:37:14 [conn58463] CMD: drop jbm.tmp.mr.mapreduce_1302248234_47440_inc
Fri Apr 8 10:37:14 [conn58463] query jbm.$cmd ntoreturn:1 command:

{ drop: "tmp.mr.mapreduce_1302248234_47440_inc" }

reslen:78 0ms
Fri Apr 8 10:37:14 [conn58463] create collection jbm.tmp.mr.mapreduce_1302248234_47440_inc

{ autoIndexId: 0 }

Fri Apr 8 10:55:32 [conn58463] mr failed, removing collection
Fri Apr 8 10:55:32 [conn58463] run command jbm.$cmd

{ drop: "tmp.mr.mapreduce_1302248234_47440" }

Fri Apr 8 10:57:01 [conn58463] CMD: drop jbm.tmp.mr.mapreduce_1302248234_47440
Fri Apr 8 10:57:01 [conn58463] query jbm.$cmd ntoreturn:1 command:

{ drop: "tmp.mr.mapreduce_1302248234_47440" }

reslen:78 88937ms
Fri Apr 8 10:57:01 [conn58463] run command jbm.$cmd

{ drop: "tmp.mr.mapreduce_1302248234_47440_inc" }

Fri Apr 8 10:57:03 [conn58463] CMD: drop jbm.tmp.mr.mapreduce_1302248234_47440_inc
Fri Apr 8 10:57:03 [conn58463] dropCollection: jbm.tmp.mr.mapreduce_1302248234_47440_inc
Fri Apr 8 10:57:03 [conn58463] dropIndexes done
Fri Apr 8 10:57:03 [conn58463] query jbm.$cmd ntoreturn:1 command:

{ drop: "tmp.mr.mapreduce_1302248234_47440_inc" }

reslen:105 2710ms
Fri Apr 8 10:57:04 [conn58463] Uncaught std::exception: St9bad_alloc, terminating
Fri Apr 8 10:57:04 [conn58463] shutdown: going to close listening sockets...
Fri Apr 8 10:57:04 [conn58463] closing listening socket: 5
Fri Apr 8 10:57:04 [conn58463] closing listening socket: 6
Fri Apr 8 10:57:04 [conn58463] closing listening socket: 7
Fri Apr 8 10:57:04 [conn58463] closing listening socket: 8
Fri Apr 8 10:57:04 [conn58463] shutdown: going to flush oplog...
Fri Apr 8 10:57:04 [conn58463] shutdown: going to close sockets...
Fri Apr 8 10:57:04 [conn58463] shutdown: waiting for fs preallocator...
Fri Apr 8 10:57:04 [conn58463] shutdown: closing all files...
Fri Apr 8 10:57:04 [conn58463] shutdown: removing fs lock...

Our plans for the immediate future are to upgrade to 1.8 version and start another machine, transforming the system into a replica set - and that is what we are working at right now. But I still want to investigate this problem and be sure that it won't be present after the upgrade.

Notes:

  • unfortunately the shutdown I just described is not reproducible - I tried the same sequence of commands and it didn't crash
  • as you can see from the log, we use map-reduce too


 Comments   
Comment by Eliot Horowitz (Inactive) [ 08/Apr/11 ]

There were a couple of memory leaks in JS that can cause issues with spidermonkey, which is why its happening in 1.6.5

Comment by Catalin Trandafir [ 08/Apr/11 ]

Thank you for the loop idea - at least I was able to reproduce the crash. I still don't understand why it is crashing on 1.6.5, but being able to reproduce it makes it a lot less frustrating.

Comment by Catalin Trandafir [ 08/Apr/11 ]

I tried to run it in a loop:

m = function() { var x; var folders = this.employers[96012].folders; for(x in folders)

{emit(x,1);}

};

r = function(k, vals) { var sum = 0; for (var i in vals)

{ sum += vals[i]; }

return sum; };

for ( i=1; i<1000; i++ ){
res = db.runCommand({
mapreduce: "mongocvs",
map: m,
reduce: r,
query: { empids: 96012, $or: [

{ folders: 815493 }

,

{ folders: 814868 }

,

{ folders: 813880 }

,

{ folders: 811648 }

,

{ folders: 811647 }

,

{ folders: 811644 }

,

{ folders: 811643 }

,

{ folders: 811642 }

,

{ folders: 811640 }

,

{ folders: 811639 }

,

{ folders: 811638 }

,

{ folders: 811637 }

,

{ folders: 806672 }

,

{ folders: 806671 }

,

{ folders: 806670 }

,

{ folders: 806583 }

,

{ folders: 806493 }

,

{ folders: 805036 }

,

{ folders: 805035 }

,

{ folders: 801933 }

,

{ folders: 801932 }

,

{ folders: 801929 }

,

{ folders: 801648 }

,

{ folders: 800608 }

,

{ folders: 798096 }

,

{ folders: 797344 }

,

{ folders: 797306 }

,

{ folders: 796627 }

,

{ folders: 796626 }

,

{ folders: 796625 }

] },
out: "mr_count_folder_cvs"
});
}

I ran this in a test environment, on 1.6.5 version - it crashed the server (tried it twice); then I tried this on another server, with 1.8 version, same data, replicaset environment, but only with the primary started (and 2 arbiters) - on 1.8 version the server doesn't crash.

Comment by Eliot Horowitz (Inactive) [ 08/Apr/11 ]

Looks like a memory leak, possible in map/reduce. If you run that same command in a loop, does it crash?

Generated at Thu Feb 08 03:01:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.