[SERVER-4190] SEGFAULT doing query Created: 02/Nov/11  Updated: 30/Mar/12  Resolved: 14/Nov/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.0.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Myers Carpenter Assignee: Aaron Staple
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 64-bit


Attachments: Text File mongodb.log    
Issue Links:
Duplicate
duplicates SERVER-4276 prevent user access to index namespaces Closed
Operating System: Linux
Participants:

 Description   

Out of the blue, the PRIMARY crashed and left itself listening for connections but not replying to anything, causing automatic fail over to not work.

Tue Nov  1 17:04:04 [conn16097] warning: no _id index on $snapshot query, ns:locos_production.data_visualization.conversations.content_set_report.active_path_funnel_forks.$_id_
Tue Nov  1 17:04:04 [conn16097] Assertion: 10334:Invalid BSONObj size: 93741056 (0x00609605) first element: : 5.432309222341481e-312
0x588cb2 0x5077a1 0x86a40f 0x86a5b8 0x970a82 0x8c4f56 0x8d4a90 0x8d50e6 0x8d891e 0x8da333 0x8db607 0x964369 0x882407 0x888c2c 0xa9c576 0x638937 0x32544077e1 0x3253ce18ed 
 /usr/local/mongodb/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x112) [0x588cb2]
 /usr/local/mongodb/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x471) [0x5077a1]
 /usr/local/mongodb/bin/mongod(_ZN5mongo19CoveredIndexMatcher7matchesERKNS_7BSONObjERKNS_7DiskLocEPNS_12MatchDetailsEb+0x15f) [0x86a40f]
 /usr/local/mongodb/bin/mongod(_ZN5mongo19CoveredIndexMatcher14matchesCurrentEPNS_6CursorEPNS_12MatchDetailsE+0xa8) [0x86a5b8]
 /usr/local/mongodb/bin/mongod(_ZN5mongo11UserQueryOp4nextEv+0x262) [0x970a82]
 /usr/local/mongodb/bin/mongod(_ZN5mongo12QueryPlanSet6Runner6nextOpERNS_7QueryOpE+0x56) [0x8c4f56]
 /usr/local/mongodb/bin/mongod(_ZN5mongo12QueryPlanSet6Runner4nextEv+0x110) [0x8d4a90]
 /usr/local/mongodb/bin/mongod(_ZN5mongo12QueryPlanSet6Runner22runUntilFirstCompletesEv+0x56) [0x8d50e6]
 /usr/local/mongodb/bin/mongod(_ZN5mongo12QueryPlanSet5runOpERNS_7QueryOpE+0x11e) [0x8d891e]
 /usr/local/mongodb/bin/mongod(_ZN5mongo16MultiPlanScanner9runOpOnceERNS_7QueryOpE+0x523) [0x8da333]
 /usr/local/mongodb/bin/mongod(_ZN5mongo16MultiPlanScanner5runOpERNS_7QueryOpE+0x17) [0x8db607]
 /usr/local/mongodb/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0xa79) [0x964369]
 /usr/local/mongodb/bin/mongod() [0x882407]
 /usr/local/mongodb/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x55c) [0x888c2c]
 /usr/local/mongodb/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xa9c576]
 /usr/local/mongodb/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x638937]
 /lib64/libpthread.so.0() [0x32544077e1]
 /lib64/libc.so.6(clone+0x6d) [0x3253ce18ed]

I'm either misreading the log or it's confused about the _id index on "data_visualization.conversations.content_set_report.active_path_funnel_forks" because db.data_visualization.conversations.content_set_report.active_path_funnel_forks.getIndexes() returns

[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"ns" : "locos_production.data_visualization.conversations.content_set_report.active_path_funnel_forks",
		"name" : "_id_"
	}
]

Full log attached.



 Comments   
Comment by Aaron Staple [ 14/Nov/11 ]

Looks like the cause is SERVER-4276 - I'm closing this ticket as a duplicate.

Comment by Eliot Horowitz (Inactive) [ 06/Nov/11 ]

Can you send a list of databases and collections from the node that crashed?

Comment by Andrew Harbick [ 06/Nov/11 ]

That is correct. We have a situation where one of our replicas (2.0.1) can be crashed by doing a mongodump using the 1.6.5 mongodump utility and it produced the above stack trace.

Two things:
1. We unfortunately don't have a clean way to bundle up this as a reproducible case.
2. I'm pretty sure that there is something corrupted in that replica (presumably only in an index as you alluded to here http://groups.google.com/group/mongodb-user/browse_thread/thread/d2c88aeb0ac86d4c) because we can't crash the other replica members using the 1.6.5 mongodump utility.

I'm not sure if we've rebuilt that replica yet off the master (it's only on a testing system right now) but I'm pretty sure that if we rebuilt it from the master everything will be back to normal...

Just wanted to call out that:
a. mongodump 1.6.5 interacted in a weird way to crash the 2.0.1 master
b. It may have been that interaction that corrupted something.
c. Even once we bring the replica back up not all is right... Replication works and the data seems fine, but there is something about that replica that remains crashable with the mongodump 1.6.5 utility
d. When the master crashed, it did so in such a way that it didn't seem to full go down so fail over of clients couldn't happen

Comment by Eliot Horowitz (Inactive) [ 06/Nov/11 ]

Something is also odd as direct queries on indexes shouldn't be done like that.

I think this is probably a bug in the 1.6.5 mongodump doing something it shouldn't.

Comment by Eliot Horowitz (Inactive) [ 06/Nov/11 ]

Just to be clear, mongodump version 1.6.5 is crashing server version 2.0.1 with the above stack trace?

Comment by Andrew Harbick [ 02/Nov/11 ]

OK... I'm pretty sure this issue is related to https://jira.mongodb.org/browse/SERVER-2973

That is:
1. I traced the error above back to someone using mongodump 1.6.5.
2. I can run mongodump 1.8.3 and it dies on a collection with a long name.
3. I can run mongodump 2.0.1 and it succeeds.
4. When a colleague runs mongodump 1.6.5 a second time the dump fails but the server also fails in a way very similar to SERVER-2973 and as described above.

So... While we can get around the problem just by using the latest version of mongodump it feels kinda bad that we can kill our database with the wrong version.

Myers is going to try to come up with a concise way to cause the problem.

Comment by Andrew Harbick [ 02/Nov/11 ]

It should be noted that the name kinda undersells this issue... It's definite "Bad TM" that the server crashes on a query. What's worse though is that the server doesn't stop listening and the automatic failover to other replicas didn't happen.

Generated at Thu Feb 08 03:05:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.