[SERVER-3927] killing clients of a loaded Mongo 1.8.3 causing seg fault Created: 22/Sep/11  Updated: 29/Feb/12  Resolved: 30/Dec/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.8.1, 1.8.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Brett Kiefer Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

FreeBSD trellisfc1.hq.fogcreek.com 8.2-RELEASE FreeBSD 8.2-RELEASE #0: Thu Feb 17 02:41:51 UTC 2011 root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC amd64


Operating System: ALL
Participants:

 Description   

This is the easiest way I have found to reproduce the issue we are seeing in production:

Repro:

  • With a default ports build of MongoDB 1.8.3 on FreeBSD 8.2 AMD 64, create a database with about 5 Gigs of data.
  • Load it up with continuous mega-inefficient text-search regex queries from 12 or so node.js clients
  • Let it run for a few minutes
  • Start killing and restarting your clients en masse

Expected:
Database stays up

Observed:
Seg fault, DB shuts down.

Thu Sep 22 11:29:29 [conn109] SocketException in connThread, closing client connection
Thu Sep 22 11:29:30 Invalid access at address: 0x5e8067

Thu Sep 22 11:29:30 Got signal: 11 (Segmentation fault: 11).

Thu Sep 22 11:29:30 Backtrace:

Thu Sep 22 11:29:30 dbexit:

We've seen this three times in production, and I have reproduced it on a test server. I'll try to get a build that will give me a backtrace.



 Comments   
Comment by Spencer Brody (Inactive) [ 30/Dec/11 ]

I'm going to resolve this ticket due to lack of activity. If this is still a problem please re-open the ticket.

Comment by Spencer Brody (Inactive) [ 28/Nov/11 ]

Hey Tim,
Any update on this? Are you still having this problem?

Comment by Eliot Horowitz (Inactive) [ 30/Sep/11 ]

Does the process crash every time you do a map/reduce or eval? Or just sometimes.

Access would be great.

Comment by Tim Stewart [ 28/Sep/11 ]

Eliot, could you rephrase your question about the JS crash? I don't understand what you're asking.

I'm sure we could arrange for you to have access to the box. Let me know if this is something you'd like to do.

Comment by Eliot Horowitz (Inactive) [ 27/Sep/11 ]

Does all JS crash or just sometimes?
Can we get access to the box?

Comment by Tim Stewart [ 26/Sep/11 ]

Oh, and /usr/local/lib/libjs.so above is Spidermonkey 1.7.0 from FreeBSD ports.

Comment by Tim Stewart [ 26/Sep/11 ]

Hello, I work with Brett K.

I did some digging to figure out why we had no backtrace. It appears that FreeBSD's libexecinfo port is returning 0 frames for the backtrace. So, no output above.

I reproduced the error within GDB and got the following output (look for "info frame", "info threads", and "bt"):

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 8020095c0 (LWP 100153)]
0x000000080149b031 in JS_DHashTableFinish () from /usr/local/lib/libjs.so
(gdb) 
(gdb) 
(gdb) info frame
Stack level 0, frame at 0x7ffffdff0a70:
 rip = 0x80149b031 in JS_DHashTableFinish; saved rip 0x8014b6787
 called by frame at 0x7ffffdff0c00
 Arglist at 0x7ffffdff0a38, args: 
 Locals at 0x7ffffdff0a38, Previous frame's sp is 0x7ffffdff0a70
 Saved registers:
  rbx at 0x7ffffdff0a40, rbp at 0x7ffffdff0a48, r12 at 0x7ffffdff0a50, r13 at 0x7ffffdff0a58, r14 at 0x7ffffdff0a60,
  rip at 0x7ffffdff0a68
(gdb) info threads
  56 Thread 802009b00 (LWP 100194)  0x00000000005d6a27 in mongo::Matcher::matchesDotted (this=0x901139ac0, 
    fieldName=0x900b3d005 "_id", toMatch=@0x9010e9e00, obj=@0x7ffffe5f2490, compareOp=12, em=@0x9010e9e00, isArr=false, 
    details=0x8ffee08b8) at db/matcher.cpp:612
* 49 Thread 8020095c0 (LWP 100153)  0x000000080149b031 in JS_DHashTableFinish () from /usr/local/lib/libjs.so
  10 Thread 80200a200 (LWP 100084)  0x0000000801b4ef7c in recvfrom () from /lib/libc.so.7
  9 Thread 80200a3c0 (LWP 100082)  0x0000000801b4ef7c in recvfrom () from /lib/libc.so.7
  8 Thread 80200a580 (LWP 100078)  0x0000000801b9ff6c in select () from /lib/libc.so.7
  7 Thread 80200a740 (LWP 100076)  0x0000000801b851ac in nanosleep () from /lib/libc.so.7
  6 Thread 80200a900 (LWP 100075)  0x0000000801b851ac in nanosleep () from /lib/libc.so.7
  5 Thread 80200aac0 (LWP 100073)  0x00000008019a43cc in __error () from /lib/libthr.so.3
  4 Thread 80200ac80 (LWP 100069)  0x0000000801b851ac in nanosleep () from /lib/libc.so.7
  3 Thread 80200ae40 (LWP 100067)  0x0000000801b0015c in sigwait () from /lib/libc.so.7
  2 Thread 8020041c0 (LWP 100106)  0x0000000801b9ff6c in select () from /lib/libc.so.7
(gdb) bt
#0  0x000000080149b031 in JS_DHashTableFinish () from /usr/local/lib/libjs.so
#1  0x00000008014b6787 in js_GC () from /usr/local/lib/libjs.so
#2  0x000000080149365c in js_DestroyContext () from /usr/local/lib/libjs.so
#3  0x000000000061a5ff in ~SMScope (this=0x8028d1400) at scripting/engine_spidermonkey.cpp:1179
#4  0x00000000005fb866 in mongo::ScriptEngine::threadDone (this=Variable "this" is not available.
) at scripting/engine.cpp:294
#5  0x00000000008ba111 in mongo::connThread (inPort=0x802824300) at db/db.cpp:329
#6  0x0000000800e83187 in thread_proxy () from /usr/local/lib/libboost_thread.so
#7  0x000000080199a4f1 in pthread_getprio () from /lib/libthr.so.3
#8  0x00007ffffddf1000 in ?? ()
Error accessing memory address 0x7ffffdff1000: Bad address.

It appears the crash itself is in JS_DHashTableFinish inside of libjs.so.

Does this ring any bells?

I can provide more output from GDB if necessary.

Comment by Brett Kiefer [ 26/Sep/11 ]

Okay, we rebuilt in devel mode, but we're still not getting stack traces. Same as before, blank backtraces. Any idea how we can get a stack trace?

Thu Sep 22 11:56:55 Backtrace:

Thu Sep 22 11:56:55 dbexit:
Thu Sep 22 11:56:55 [conn17] shutdown: going to close listening sockets...
Thu Sep 22 11:56:55 [conn17] closing listening socket: 4
Thu Sep 22 11:56:55 [conn17] closing listening socket: 5
Thu Sep 22 11:56:55 [conn17] closing listening socket: 6
Thu Sep 22 11:56:55 [conn17] closing listening socket: 7
Thu Sep 22 11:56:55 [conn17] removing socket file: /tmp/mongodb-27017.sock
Thu Sep 22 11:56:55 [conn17] removing socket file: /tmp/mongodb-28017.sock
Thu Sep 22 11:56:55 [conn17] shutdown: going to flush diaglog...
Thu Sep 22 11:56:55 [conn17] shutdown: going to close sockets...
Thu Sep 22 11:56:55 [conn17] shutdown: waiting for fs preallocator...
Thu Sep 22 11:56:55 [conn17] shutdown: closing all files...
Thu Sep 22 11:56:55 Invalid access at address: 0x86bdb2246

Thu Sep 22 11:56:55 Got signal: 11 (Segmentation fault: 11).

Thu Sep 22 11:56:55 Backtrace:

Thu Sep 22 11:56:55 dbexit: ; exiting immediately
Thu Sep 22 11:56:55 Invalid access at address: 0x86bdb2246

Thu Sep 22 11:56:55 Got signal: 11 (Segmentation fault: 11).

Thu Sep 22 11:56:55 Backtrace:

Thu Sep 22 11:56:55 Invalid access at address: 0x86bdb2246

Thu Sep 22 11:56:55 Got signal: 11 (Segmentation fault: 11).

Thu Sep 22 11:56:55 Backtrace:

Thu Sep 22 11:56:55 Invalid access at address: 0x86bdb2246

Thu Sep 22 11:56:55 Got signal: 11 (Segmentation fault: 11).

Thu Sep 22 11:56:55 Backtrace:

Thu Sep 22 11:56:55 closeAllFiles() finished
Thu Sep 22 11:56:55 [conn17] shutdown: removing fs lock...
Thu Sep 22 11:56:55 dbexit: really exiting now
Thu Sep 22 11:56:55 Invalid access at address: 0x86bdb2246

Thu Sep 22 11:56:55 Got signal: 11 (Segmentation fault: 11).

Thu Sep 22 11:56:55 Backtrace:

Thu Sep 22 11:56:55 ERROR: Client::~Client _context should be null but is not; client:conn

Comment by Brett Kiefer [ 26/Sep/11 ]

We can try to repro the issue on Linux, but I don't know that it would tell us anything new about the problem - if it did seg fault, we'd be in the same place, and if it didn't, we'd have to figure that memory allocation differences on Linux are saving it.

Or are you saying that right now it is generally foolish to run MongoDB on FreeBSD in production? We had another Trello outage this morning because MongoDB was not responding to queries, even though the service was up - we're getting the specifics right now. Restarting Mongo fixed it, but if MongoDB on FreeBSD just isn't stable yet, we will certainly think about moving.

Comment by Eliot Horowitz (Inactive) [ 23/Sep/11 ]

we've seen various odd things with freebsd to date and is not a platform we fully test on yet.
any chance of trying on linux?

Generated at Thu Feb 08 03:04:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.