[SERVER-3439] mongod complains about too much data uncommitted, crashes with assertionFailure, then fails to start Created: 18/Jul/11  Updated: 12/Jul/16  Resolved: 03/Aug/11

Status: Closed
Project: Core Server
Component/s: Replication, Stability
Affects Version/s: 1.8.2
Fix Version/s: 1.9.2

Type: Bug Priority: Critical - P2
Reporter: Mike K Assignee: Dwight Merriman
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu EC2 Linux on m2.4xlarge


Attachments: Text File mongocrash.log     Text File mongocrash2.log     Text File repair-crash.txt    
Issue Links:
Duplicate
duplicates SERVER-3447 journal exception during large operat... Closed
Operating System: ALL
Participants:

 Description   

The Primary of one of our replica sets suddenly died with the attached stack trace (mongocrash.log) and now refuses to start, throwing stack traces like the one in mongocrash2.log.



 Comments   
Comment by Dwight Merriman [ 03/Aug/11 ]

hopefully fixed via a couple changes including this one:
https://github.com/mongodb/mongo/commit/3b4b0026cd2e018346bf068109ba0a9368a07eeb

Comment by Dwight Merriman [ 20/Jul/11 ]

ok have a theory as to the repair crash, will be trying some things

Comment by Dwight Merriman [ 18/Jul/11 ]

email me re: looking at the volume tx. dwight@10gen.com

Comment by Mike K [ 18/Jul/11 ]

Also in terms of commands etc, nothing different (as far as we can tell) was happening; this is a system that's been running for about 2 months without any major changes.

Comment by Mike K [ 18/Jul/11 ]

The cluster failed over to the secondary, but about 5 minutes ago the secondary failed too (unfortunately I only have the bottom of the stack trace):

/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7ff67d0afd05]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x186) [0x7ff67d0b3ab6]
/lib/x86_64-linux-gnu/libc.so.6(__assert_fail+0xf5) [0x7ff67d0a87c5]
mongod(_ZN5mongo8SpinLock4lockEv+0xa5) [0x56fa65]
mongod(_ZN5mongo14NetworkCounter3hitExx+0x83) [0x600163]
mongod(_ZN5mongo10connThreadEPNS_13MessagingPortE+0x783) [0x8a9533]
mongod(thread_proxy+0x80) [0x8bb630]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6d8c) [0x7ff67dbb7d8c]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ff67d16204d]

Mon Jul 18 17:20:10 [LockPinger] warning: dist_lock process: ip-10-78-233-4:1311001137:135267863 pinging: config-social1:20000,config-social2:20000,config-social3:20000 failed: Invalid BSONObj size: 0 (0x00000000) first element: EOO

At this point the secondary (now the primary) wouldn't start up at all, until I removed the local files. Then, it would start, but a few minutes later gave me this:

social1:PRIMARY> db.oplog.rs.find()
error: {
"$err" : "Invalid BSONObj size: 604176384 (0x00000324) first element: ddToSet: ?type=97",
"code" : 10334
}

Clearing out the local files one more time seems to have gotten it to start again. For the other one, no combination of removing local, journal, lock, etc files gets it to start...so we're going to try syncing from scratch from the secondary.

I've saved the EBS volumes if you want to have a look.

Comment by Dwight Merriman [ 18/Jul/11 ]

investigating. not obvious what is wrong. is just one node down – i.e. is the cluster otherwise up?

what's your command line like?

any operation unusual that was started just before this manifested?

Comment by Mike K [ 18/Jul/11 ]

Also, it crashes when started with --repair (attached as repair-crash.txt)

Comment by Dwight Merriman [ 18/Jul/11 ]

checking

Generated at Thu Feb 08 03:03:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.