[SERVER-3439] mongod complains about too much data uncommitted, crashes with assertionFailure, then fails to start Created: 18/Jul/11 Updated: 12/Jul/16 Resolved: 03/Aug/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Stability |
| Affects Version/s: | 1.8.2 |
| Fix Version/s: | 1.9.2 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Mike K | Assignee: | Dwight Merriman |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu EC2 Linux on m2.4xlarge |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
The Primary of one of our replica sets suddenly died with the attached stack trace (mongocrash.log) and now refuses to start, throwing stack traces like the one in mongocrash2.log. |
| Comments |
| Comment by Dwight Merriman [ 03/Aug/11 ] |
|
hopefully fixed via a couple changes including this one: |
| Comment by Dwight Merriman [ 20/Jul/11 ] |
|
ok have a theory as to the repair crash, will be trying some things |
| Comment by Dwight Merriman [ 18/Jul/11 ] |
|
email me re: looking at the volume tx. dwight@10gen.com |
| Comment by Mike K [ 18/Jul/11 ] |
|
Also in terms of commands etc, nothing different (as far as we can tell) was happening; this is a system that's been running for about 2 months without any major changes. |
| Comment by Mike K [ 18/Jul/11 ] |
|
The cluster failed over to the secondary, but about 5 minutes ago the secondary failed too (unfortunately I only have the bottom of the stack trace): /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7ff67d0afd05] Mon Jul 18 17:20:10 [LockPinger] warning: dist_lock process: ip-10-78-233-4:1311001137:135267863 pinging: config-social1:20000,config-social2:20000,config-social3:20000 failed: Invalid BSONObj size: 0 (0x00000000) first element: EOO At this point the secondary (now the primary) wouldn't start up at all, until I removed the local files. Then, it would start, but a few minutes later gave me this: social1:PRIMARY> db.oplog.rs.find() Clearing out the local files one more time seems to have gotten it to start again. For the other one, no combination of removing local, journal, lock, etc files gets it to start...so we're going to try syncing from scratch from the secondary. I've saved the EBS volumes if you want to have a look. |
| Comment by Dwight Merriman [ 18/Jul/11 ] |
|
investigating. not obvious what is wrong. is just one node down – i.e. is the cluster otherwise up? what's your command line like? any operation unusual that was started just before this manifested? |
| Comment by Mike K [ 18/Jul/11 ] |
|
Also, it crashes when started with --repair (attached as repair-crash.txt) |
| Comment by Dwight Merriman [ 18/Jul/11 ] |
|
checking |