[SERVER-2113] repeatable server crash (fixed by --repair) Created: 17/Nov/10  Updated: 29/May/12  Resolved: 02/Sep/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Drew Perttula Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

1.6.3 64bit ubuntu linux


Operating System: ALL
Participants:

 Description   

FYI, I was getting this crash when my {t: {$gte: someTime}} query was at just the wrong time, i.e. when a certain document was going to be in the result set. I tried --repair, and while the first attempt froze my whole machine during indexing (bad disk? power?), a successful --repair run seems to have fixed the problem.

Tue Nov 16 23:25:24 Backtrace:
0x8212f9 0x7fdc251a2c20 0x5b868b 0x5b942b 0x6eb872 0x5fdeaf 0x704cc0 0x708acd 0x8235ef 0x837460 0x7fdc25c98971 0x7fdc2525591d
build_mongodb/bin/mongod(mongo::abruptQuit(int)+0x399) [0x8212f9]
/lib/libc.so.6(+0x33c20) [0x7fdc251a2c20]
build_mongodb/bin/mongod(mongo::Matcher::matchesDotted(char const*, mongo::BSONElement const&, mongo::BSONObj const&, int, mongo::ElementMatcher const&, bool, mongo::MatchDetails*)+0x1fcb) [0x5b868b]
build_mongodb/bin/mongod(mongo::Matcher::matches(mongo::BSONObj const&, mongo::MatchDetails*)+0xeb) [0x5b942b]
build_mongodb/bin/mongod(mongo::CoveredIndexMatcher::matches(mongo::BSONObj const&, mongo::DiskLoc const&, mongo::MatchDetails*)+0xe2) [0x6eb872]
build_mongodb/bin/mongod(mongo::processGetMore(char const*, int, long long, mongo::CurOp&, int, bool&)+0x29f) [0x5fdeaf]
build_mongodb/bin/mongod(mongo::receivedGetMore(mongo::DbResponse&, mongo::Message&, mongo::CurOp&)+0x240) [0x704cc0]
build_mongodb/bin/mongod(mongo::assembleResponse(mongo::Message&, mongo::DbResponse&, mongo::SockAddr const&)+0x14ed) [0x708acd]
build_mongodb/bin/mongod(mongo::connThread(mongo::MessagingPort*)+0x30f) [0x8235ef]
build_mongodb/bin/mongod(thread_proxy+0x80) [0x837460]
/lib/libpthread.so.0(+0x7971) [0x7fdc25c98971]
/lib/libc.so.6(clone+0x6d) [0x7fdc2525591d]

Tue Nov 16 23:25:24 dbexit:



 Comments   
Comment by Eliot Horowitz (Inactive) [ 02/Sep/11 ]

Just a note that for this setup you probably want to run with journalling

Comment by Drew Perttula [ 21/Nov/10 ]

I have nothing auto-removing lock files. When we get leftover lock files, we usually run repair. But the crash shown in this ticket never left a lock file. "mongod --dbpath /db --port 11021" would get autorestarted and be back up within a second or two.

Comment by Eliot Horowitz (Inactive) [ 21/Nov/10 ]

After a crash until 1.8 you have to run --repair.
It shouldn't let you start the db without doing that.
Are you manually removing lock file?

Comment by Drew Perttula [ 21/Nov/10 ]

Nope. I don't have separate startup commands for 'normal' and 'after a crash', so I didn't include --repair since it would slow down all the normal startups.

Comment by Eliot Horowitz (Inactive) [ 18/Nov/10 ]

Does the automatic restarter run --repair?

Comment by Drew Perttula [ 18/Nov/10 ]

I'm not exactly sure what you mean. The system was crashing a lot over the weekend for unrelated reasons (attempted hardware upgrades). Since I run mongod under supervisord with automatic restarts, even total mongod crashes like the above are actually pretty unnoticeable to me unless I go digging. I think the crashes had been happening every 30 minutes for more than a day by the time I started investigating why my one query always seemed to fail with "could not find master/primary".

I mostly posted this trace so you could notice if you were getting a lot of reports in that one method, or if the bug was obvious enough that it could be fixed just from the trace. I don't think there's much to be done about my particular situation. Sorry I didn't clone the corrupt db before the repair.

Comment by Eliot Horowitz (Inactive) [ 17/Nov/10 ]

Had this system/mongod ever crashed without a full repair before?

Generated at Thu Feb 08 02:59:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.