[SERVER-10888] rsHealthPoll causing segmentation fault Created: 25/Sep/13 Updated: 11/Jul/16 Resolved: 18/Nov/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Craig W | Assignee: | Andreas Nilsson |
| Resolution: | Done | Votes: | 3 |
| Labels: | crash, replicaset, segfault, ssl | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
RHEL 5.9 |
||
| Operating System: | Linux |
| Participants: |
| Description |
|
I have a 3 node replica-set configured using MongoDB 2.4.5 with SSL. In the past two months of having this configuration, I've observed two of the three mongod instances seg faulting on occassion. It's not always the same two servers and it sometimes happens every few days but other times we've gone 2 weeks without an issue. In the latest crash, db3 was primary and db2 and db1 were secondary. In this event, db3 and db2 crashed but db1 continued to run (although it's replication status was "secondary"). In the db3 log (I replaced host:port with db2) when it crashed I see: Wed Sep 25 05:31:59.983 [rsHealthPoll] couldn't connect to db2 Wed Sep 25 05:32:00.060 Got signal: 11 (Segmentation fault). Wed Sep 25 05:32:00.062 [rsHealthPoll] couldn't connect to db2 |
| Comments |
| Comment by Andreas Nilsson [ 18/Nov/13 ] | ||||||||||||||||||||||||||||||||||||
|
Excellent, then I am setting the ticket to resolved. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 18/Nov/13 ] | ||||||||||||||||||||||||||||||||||||
|
I think this issue can be closed out. Our 3 node replica set has been running for 2 weeks w/o any additional problems. Thanks for your help. | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Nilsson [ 04/Nov/13 ] | ||||||||||||||||||||||||||||||||||||
|
codecraig I'm not sure that this is related to the issue in this ticket. Maybe you should file a support ticket for it? As for this actual problem in addition to upgrading SSL we also made a code change in MongoDB that will mitigate the problem. It has been committed to the current master branch and will be backported for 2.4.9. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 01/Nov/13 ] | ||||||||||||||||||||||||||||||||||||
|
All 3 nodes, in a replica set, are now using mongo 2.4.5 with openssl 1.0.1e. However, one node (happens to be the CentOS 5.5 server) is dropping out (appears DOWN), in the logs I see this:
It periodically seems to connect again, right now it's back to being a SECONDARY. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 01/Nov/13 ] | ||||||||||||||||||||||||||||||||||||
|
I got a little further with trying to use mongo-2.4.5 that I compiled on a RHEL 5.9 VM on a CentOS 5.5 VM. I symlinked /usr/local/ssl/lib/libssl.so.1.0.0 and /usr/local/ssl/lib/libcrypto.so.1.0.0 to: /usr/lib Now mongo finds libssl and it's running. I'll update our 3rd node in this replica set and keep an eye on it for the next couple of weeks. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 01/Nov/13 ] | ||||||||||||||||||||||||||||||||||||
|
So I tried to take the mongo-2.4.5 version that I compiled on RHEL 5.9 with openssl 1.0.1e and copy it over to a Cent OS 5.5. However, when I try to start it on the CentOS box it complains about not being able to find libssl.so.1.0.0 even though it does exist and I've symlinked it in the same places as I did on the RHEL 5.9 VM. Any ideas? Should I have to recompile mongo on the CentOS box? | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 31/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
Ok, build is finally complete! I just started up this new compiled version of mongo and joined it to our existing replica set with SSL enabled. I re-ran "lsof -c mongod" and it's using libssl.so.1.0.0 I might give this instance a few days before upgrading the other two, I definitely don't want to accidentally lose data. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 31/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
After some googling, I gave this a shot:
Now when I try to start mongo I get:
I'm guessing this is happening b/c I build mongo on different VM which may have had different glibc versions. So I'm not trying to compile it on the machine that I'd like to run mongodb on (using the same steps outlined earlier), however, when I do the "scons --ssl ..." step, it's complaining that it can find libssl.so.1.0.0, which does exist in /usr/local/ssl/lib. I've now symlinked libssl and libcrypto using:
The build is currently running now....fingers crossed. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 31/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
I compiled MongoDB 2.4.5 (source) using the following steps: Install openssl
Install Python
Install SCons
Compile MongoDBIt might be worth mentioning to compile I needed around 12GB of free disk space. Also the options I passed to scons were based on this stackoverflow comment from a MongoDB employee.
Run MongoDBI tar'ed up the binaries (mongo, mongod, mongoimport, etc) and copied them to a RedHat 5.9 VM. I did the following to setup this new mongo instance.
The conf file looks like this:
I then tried to start mongo using: > /opt/mongodb-2.4.5/bin/mongod --config /opt/mongodb-2.4.5/conf/mongodb.conf --dbpath /opt/mongodb-2.4.5/db but I get: /opt/mongodb-2.4.5/bin/mongod: error while loading shared libraries: libssl.so.1.0.0: cannot open shared object file: No such file or directory Now, the VM I'm trying to run this mongo on has openssl 1.0.1e installed. The libssl.so.1.0.0 exists in /usr/local/ssl/lib. Is mongo looking for the openssl library to be somewhere else? When I compiled Mongo the libssl.so.1.0.0 was in /usr/local/ssl/lib as well. Any ideas? | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Nilsson [ 29/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
Ok, we'll keep this one open for now. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 29/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
I hope to try recompiling this week, I'll report back. | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Nilsson [ 24/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
So from my perspective this is good news and we might still have an immediate fix for your issues. If this turns out to be OpenSSL-related only we will try to bisect the actual bug fix and ask Red Hat to support an OpenSSL version with this fix but that will be a long process. You can link it statically if you want but it is not something we typically recommend doing. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 24/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
So the output I get shows: /lib64/libssl.so.0.9.8e I suppose we'd have to recompile mongo with it linked to 1.0.1e? Are we able to compile it statically, so we can transfer the compiled build to each node instead of having to compile multiple times? (I ask b/c I wasn't the one who initially compiled it, and that person isn't available to ask). | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Nilsson [ 24/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
Ok that is interesting. Our tests have been running for 24 x the reproduction time with 1.0.1e without getting any segfaults. Since RHEL does come with a preinstalled OpenSSL lib can you run the following command to verify that mongod is actually using the new version: > lsof -c mongod look for the entries libssl and libcrypto in the output. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 24/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
We're using "OpenSSL 1.0.1e 11 Feb 2013" which is the latest available version from openssl.org. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 23/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
I'll double check tomorrow and get back to you, thanks. | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Nilsson [ 23/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
codecraig we currently have a theory that this might be related to the OpenSSL library and not necessarily MongoDB. We had a similar support case at another client where upgrading OpenSSL seems to fix the issue. Which version of OpenSSL are you using? | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 23/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
Today one of the nodes in the replica set (currently a secondary) is having issues staying connected with the replica set, or at least the other members think its down, but then it'll connect and will be in "STARTUP 2" status again. In the logs on this node that's disconnecting I see: [rsHealthPoll] replset info <host1>:27017 thinks that we are down host1 is the primary, host2 is the other secondary. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 23/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
I can provide some of the information, but I can't provide the log files. The cluster consists of 3 RHEL 5.9 VM's each with a single dual-core CPU and 4GB RAM. The mongod is version 2.4.5 (built from the 2.4.5 source tarball) The config file (same on each DB server):
| ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Nilsson [ 22/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
codecraig I agree it does look like an SSL-related issue and we are investigating this. We are setting up a cluster to try to reproduce the error. Since we don't seem to get a usable output from addr2line can you provide the following information:
Thank you for your assistance so far. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 22/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
It happened again overnight. We're starting to lean towards this being an SSL related issue of some kind since our replica set never crashed in the past when not using SSL. We'll have to try in the next few weeks. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 21/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
For what it's worth, this problem happened again this weekend, 2 of 3 nodes died from the segfault. I ran the addr2line again (just in case I might somehow get different ooutput) but I still just got "??:0" over and over. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 16/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
I forgot the path in the initial comment. | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Nilsson [ 16/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
So did that generate a different output or did you just miss the path it in the ticket comment? | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 16/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
I did, sorry. addr2line -i -e /opt/mongodb/bin/mongod ..... | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Nilsson [ 16/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
codecraig did you specify the binary path in the call to addr2line? It looks like it is missing. | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 16/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
Andreas: We did build the binary ourselves (to enable ssl support), unfortunately addr2line doesn't provide anything useful: > addr2line -i -e 0xde1c11 0x6d2c29 0x6d31b2 0x2b02176c7ca0 .... Any other ideas or things you'd like me to try? | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Nilsson [ 16/Oct/13 ] | ||||||||||||||||||||||||||||||||||||
|
codecraig, sorry for the delay in getting back to you. In order for us to troubleshoot, can you please provide the following information:
If you built the binary yourself you can use addr2line to extract the necessary information. You can find more information at http://www.mongodb.org/about/contributors/parsing-stack-traces/ Thank, | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 25/Sep/13 ] | ||||||||||||||||||||||||||||||||||||
|
There were some "relinquish" messages in the db3 logs (which was the primary node), but those messages were from September 23. Earlier in the day on September 24 I confirmed that db3 was primary and db2 and db1 were secondaries, so the relinquish was a while before that. | ||||||||||||||||||||||||||||||||||||
| Comment by Andreas Nilsson [ 25/Sep/13 ] | ||||||||||||||||||||||||||||||||||||
|
I think it could be related to | ||||||||||||||||||||||||||||||||||||
| Comment by Craig W [ 25/Sep/13 ] | ||||||||||||||||||||||||||||||||||||
|
I forgot to show what I saw in db2's log when it segfaulted: Tue Sep 24 18:08:58.481 Invalid access at address: 0x4ad0 from thread: rsHealthPoll Tue Sep 24 18:08:58.525 Got signal: 11 (Segmentation fault). Tue Sep 24 18:08:58.838 Backtrace: One glaring observation is that db2 segfaulted September 24 at 2:08PM (EST) and db3 didn't segfaulted until nearly 12 hours later (September 25 ~1AM (EST)). |