[SERVER-3131] JS Error: out of memory leading to segfaults Created: 23/May/11  Updated: 12/Jul/16  Resolved: 29/Jun/11

Status: Closed
Project: Core Server
Component/s: JavaScript
Affects Version/s: 1.8.1
Fix Version/s: 1.9.1

Type: Bug Priority: Major - P3
Reporter: Paul Harvey Assignee: Antoine Girbal
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 10.04.2 LTS, 4GB RAM VMWare instance with one CPU core (Intel(R) Xeon(R) CPU X5680 @ 3.33GHz)

Three member replica set, running as follows:
mongod --pidfilepath /var/lib/mongodb/seta.pid --config /etc/mongodb/seta.conf --replSet seta --dbpath /var/lib/mongodb/seta -vvvvv --port 27017 --logpath /var/log/mongodb/seta.log

/etc/mongodb/seta.conf:
logappend=true
noprealloc = true
smallfiles = true
directoryperdb = true

We were crashing with 1.8.1 from the 10gben deb repo, and we just got another segfault today running mongod --version
db version v1.8.2-rc1, pdfile version 4.5
Mon May 23 17:18:45 git version: da537eae3ec18611424dabb63a202d4940059be4

Members are running only MongoDB, and nothing else

This is the same setup which was discussed at http://groups.google.com/group/mongodb-user/browse_thread/thread/35efda30f3aeff35


Attachments: File seta.log.022b     File seta.log.022c    
Operating System: Linux
Participants:

 Description   

Full details are at http://foswiki.org/Tasks/Item10672?section=mongodb-user2#Crash_5 (and even more details at http://foswiki.org/Tasks/Item10672).

We tested for two weeks before putting into production, a new version of our application (Foswiki) using MongoDB 1.8.1 as a query-cache/accelerator. There were no unexplained instabilities. In production now though, we can run the site for a couple of days, but at least twice a week we are getting segfaults in whichever mongod happens to be the primary.

It's not a sudden thing - we get these spurious JS Error: out of memory and Assertion: 10432:JS_NewObject failed for global warnings in the log, for an hour or two, before the mongod process segfaults.

The problem is extremely hard to reproduce; we haven't been able to reproduce ourselves, in our test environment, using artificial loads (our production site is public to the Internet).

seta.log.022c is a snippet leading up to the latest segfault.

seta.log.022b is a snippet leading up to and including the first few minutes of problems after running fine for a couple of days.

Both are using -vvvvv verbosity, which generates ~10GB/day of log files (extremely burdensome), so I've filtered them using grep -v ^checking



 Comments   
Comment by Jason R. Coombs [ 09/Jul/11 ]

We started getting these "out of memory" errors today. We just moved to a new datacenter, so we have a few extra variables to contend with. We're also running MongoDB 1.8.1 (same as in the original datacenter). We did not encounter the segfault, even after several hours of the OOM errors, but we did restart the process to restore service. The OOM errors started after running for about 8 hours. We're using the same applications, though we're running more nodes. We're under moderate load. We have turned on journaling for our master DB (which we did not in our old DC).

Should this ticket be marked as resolved when all that was done was to allow for an increase in JS memory? Is that the prescribed fix (wait until you get these errors, then bump up your JS memory)? Does the resolution indicate that the leak is caused by a client's JS code itself?

Is there any value in upgrading to 1.8.2 with respect to this issue? Should we consider limiting the number of application nodes connecting? Is there any reason to think that journaling would have any impact?

For now, we're watching for the OOM errors, expecting they'll crop up again in a few hours.

Comment by Antoine Girbal [ 29/Jun/11 ]

moving this issue to 1.9.1 version since the increase in JS memory is available in 1.9 line.

Comment by Antoine Girbal [ 25/May/11 ]

ok let us know how it goes.
On our side we'll try to find leak

Comment by Paul Harvey [ 25/May/11 ]

We are now running with a build from b6f07e2b6db67ce4ef5812af4b912e3851f388f0 and changed the limit to 128M (resulting in 14fa0ca45328097b38d4d9dcf39081302079ecc6). For now I'm going to keep the cron job which restarts everything, for a while, so I'll come back in a week or two with my findings.

Still, I feel that we need extra debug info to really get to the bottom of this leak.

Comment by Paul Harvey [ 24/May/11 ]

Thank you! I'll git it a shot tonight. As I mentioned on SERVER-3012, it would be great if there was some extra debug info we could get out of -vvvvv[...] debug logs, which could aid us in tracking down a JS leak. I don't know how JS memory management is done, but perhaps GC events or heapsize changes could be useful... if those things exist

Comment by Antoine Girbal [ 24/May/11 ]

The fact that these errors appear after running the app for some time, and then dont go away, point to a mem leak.
Make sure that you dont use any global js variable that may grow over time, or set a variable with a random name.
If not the mem leak may be in our wrapper or SM itself.

following test of SERVER-3012, I increased the SM mem limit to 64MB in trunk.
This should buy you 8x the time before the error occurs.
If you want to extend further then you can increase it in src code and recompile.

In engine_spidermonkey.cpp look for:
_runtime = JS_NewRuntime(64L * 1024L * 1024L);

Comment by Paul Harvey [ 24/May/11 ]

SERVER-3012 at least has a reproducible test case. Perhaps fixing that bug will make this one disappear too... wishful thinking?

Comment by Paul Harvey [ 23/May/11 ]

I should clarify that we have three replica members, each are their own separate VMs (4GB RAM). They are in the same DC.

Also the 022b log although 270KiB, is only 40 seconds of time elapsed, not "several minutes" as I said in the initial description. -vvvvv is extreme logging!

Generated at Thu Feb 08 03:02:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.