[SERVER-20741] Primary crash after N hours of running as primary Created: 02/Oct/15  Updated: 08/Jan/24  Resolved: 12/Oct/15

Status: Closed
Project: Core Server
Component/s: JavaScript
Affects Version/s: 3.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Julien Durillon Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File crash.log     File full-crash.log     File mapReduce-crash.log     Text File mongod-ldd.txt     File mongodb-build-server.log     File primary-crash.log    
Operating System: ALL
Participants:

 Description   

I manage a sharded cluster for my company. That cluster is used by clients as a free cluster: they provision a db and can use it (with some limitations) with their applications.

I moved from 2.6 to 3.0.6 a week ago (on Thursday 2015-09-24), and ever since I have this strange behavior: after being elected as primary, a node will last a few hours (between 2 and 5) and then crash.
The crash is a segmentation fault.

We have systemd restarting the node automatically, and in the meantime, a new node is elected as primary and run for a few more hours then crashes, and another one is elected primary, etc.

The cluster is composed of 3 config servers, 3 mongos, and 5 mongod all within a single RS and handling a single shard.
The 5 mongod are 2 arbiters and 3 data nodes.
The 3 data nodes are 1 MMAPv1 and 2 wiredTiger.

All 3 data nodes crash a few hours after being elected master.

I attached the log of a primary starting 30 seconds before the segfault happens.

/sys/kernel/mm/transparent_hugepage/defrag does not exist on 2 of the 3 servers, and I set it to "never" on the third one.



 Comments   
Comment by Ramon Fernandez Marina [ 29/Jan/16 ]

For the record, SERVER-22334 shows how a missing "var" keyword in the JS code could trigger this issue. MongoDB 3.2 uses SpiderMonkey as the JavaScript engine an it handles this case better than V8.

Comment by Ramon Fernandez Marina [ 12/Oct/15 ]

Thanks for the additional information judu. Since the issue you describe does not point to a bug in the server I'm going to close this ticket.

If you need assistance building MongoDB from sources you can post in the mongodb-dev group; please make sure to provide information about the version of the tools and libraries you're using. In particular, we don't yet support compiling with gcc 4.9 or older.

For user support discussions please post on the mongodb-user group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Comment by Julien Durillon [ 12/Oct/15 ]

Ok, so here is the build log of my currently running instances, which are still encountering the same crash stacktrace.
Built against libboost 1.59.0.
Note: we built 2.6 ourselves too without a problem. The v8 used is the one provided by the source, not a system one.

Comment by Julien Durillon [ 09/Oct/15 ]

Build log of mongodb.

Comment by Julien Durillon [ 08/Oct/15 ]

Sorry, I'm testing something about the build, so build log is coming. I'm not forgetting!

Comment by Julien Durillon [ 05/Oct/15 ]

Crash log for conn1606 with all the logs. (Same as crash.log, but with all the logs from all the connections in case I erased a bit too much in crash.log.)

Comment by Julien Durillon [ 05/Oct/15 ]

Crash log with all (and only) the [conn1606] in it.

Comment by Ramon Fernandez Marina [ 05/Oct/15 ]

judu, can you please send a longer part of the log? In particular I'm looking for more details about conn447, which is the one involved in the segfault:

2015-10-02T18:44:55.505+0000 F -        [conn447] Invalid access at address: 0
2015-10-02015-10-02T18:44:55.521+0000 F -        [conn447] Got signal: 11 (Segmentation fault).

Also, how did you install this mongod instance? Did you use a package manager or did you build it from sources? If the latter, can you please send the command line used to build it?

Thanks,
Ramón.

Comment by Julien Durillon [ 02/Oct/15 ]

Thanks for your quick answer

Ok, first, we do not use SELinux nor grsecurity.
I took the example systemd service file from the mongodb doc. With:
LimitFSIZE=infinity
LimitCPU=infinity
LimitAS=infinity
LimitNOFILE=64000
LimitNPROC=64000

So, I'm attaching the result of `ldd mongod` if you can find anything strange in it.

In the log I attached, there is a mapReduce at the beginning. It's the last operation of that kind.

I ran that mapReduce again, and succeeded to crash the primary node by doing so. I included the log starting at the first mapReduce command.

Comment by Ramon Fernandez Marina [ 02/Oct/15 ]

Sorry you've run into this judu. The crash shows the problem is happening inside the V8 engine:

 mongod(_ZN2v82V837AdjustAmountOfExternalAllocatedMemoryEl+0x16) [0x11ca146]

We've seen similar issues in V8 when the machine is configured with SELinux, grsecurity, or imposes other limitations that affect V8's memory management. Can you please elaborate on the configuration for the affected node? Also, can you provide details of what operations this node was running when it crashed? I'm looking for javascript-related operations like using $where or mapReduce.

Thanks,
Ramón.

Generated at Thu Feb 08 03:55:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.