[SERVER-9052] During failover in replicaset, MongoDB crashes Created: 21/Mar/13  Updated: 11/Jul/16  Resolved: 26/Mar/13

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 2.2.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dave Claussen Assignee: Stephen Lee
Resolution: Done Votes: 0
Labels: crash, linux, replicaset
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Unix Distribution: CentOS release 5.8 (Final),

Driver: mongo-java-driver:2.9.1

Build: Compiled from source, with this command line:

  1. scons install -j 9 --64 --ssl --prefix=/tmp/mongodb-linux-2.2.3-x86_64

OpenSSL version: 0.9.8e-22.el5_8.4

Deployment: one primary instance, one secondary, one arbiter.

build info: Linux 12.servername.com 2.6.18-194.3.1.el5.028stab069.6 #1 SMP Tue Aug 10 21:28:51 GMT 2010 x86_64 BOOST_LIB_VERSION=1_49


Attachments: Text File mongo_crash_03_20_primary.log     Text File mongo_crash_03_20_secondary.log    
Operating System: Linux
Steps To Reproduce:

Not sure exactly, but my guess is to have one primary, one secondary, and an arbiter and cause the secondary to die. Hopefully the logs will be enough to give you some information that will help determine what went wrong.

Participants:

 Description   

We recently upgraded Mongo from 2.2.2 to 2.2.3. After a couple days of running, the primary couldn't be contacted by the arbiter and the secondary was elected to take over... however, at that point, MongoDB just stopped responding and put tons of errors in the log.

(See attachments for the logs)

The primary has these types of errors:

problem detected during query over (DBNAME).(COLLECTION_NAME) :

{ $err: "not master and slaveOk=false", code: 13435 }

[rsMgr] replSet can't see a majority, will not try to elect self

recv(): message len XXX is too largeXX

Assertion: 16141:cannot translate opcode 26975

... but see the log for more details.

The Java application server running the same box as the primary MongoDB instance (o16.servername.com) uses ReadPreference.primaryPreferred() when it does queries.

The Java application server running the same box as the secondary MongoDB instance (o15.servername.com) uses ReadPreference.nearest() when it does queries, since it's physically across the country from the primary.

P.S. after this happened, I updated my Java driver to mongo-java-driver:2.10.1 and am considering rebuilding using the latest openssl build (0.9.8e-26.el5_9.1).



 Comments   
Comment by Stephen Lee [ 26/Mar/13 ]

No, aside from your use of OpenVZ, the only other major concern was JAVA-660, which you've rectified by upgrading the Java driver. Let us know if run into any other issues!

Comment by Dave Claussen [ 26/Mar/13 ]

No issues since then, but I did find out that our Verio server is, in fact, using OpenVZ, despite the fact that we have a dedicated server. So the message "You are running in OpenVZ. This is known to be broken!!!" that I've been ignoring in the Mongo logs is in fact correct. =)

We're switching our infrastructure to a different platform, so unless there's anything obvious in the logs you can close this ticket.

-D

Comment by Stephen Lee [ 26/Mar/13 ]

Have you noticed any issues since upgrading your Java driver? JAVA-660 could cause corruption, which might account for those error messages.

Comment by Dave Claussen [ 25/Mar/13 ]

Stephen,

I've upgraded the Java driver to mongo-java-driver:2.10.1, and openssl to 0.9.8e-26.el5_9.1.

Comment by Stephen Lee [ 25/Mar/13 ]

Dave, I would strongly recommend you upgrade your Java driver to v2.9.3 or later, due to JAVA-660.

Generated at Thu Feb 08 03:19:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.