[SERVER-9922] Mongod instance (2.4.4) crashed with the following errors with segmentation fault Created: 13/Jun/13  Updated: 10/Dec/14  Resolved: 27/Nov/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.4.4
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Neeraj Punmiya Assignee: Unassigned
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Mongo instances are running in replication mode.
O/S: RHEL 6.3 64 bit


Attachments: Zip Archive mongod.zip    
Issue Links:
Duplicate
duplicates SERVER-9824 Assertion on unordered_fast_key_table... Closed
Related
Operating System: Linux
Steps To Reproduce:

Crashed at midnight. Log attached for analysis

Participants:

 Description   

Server log
==========

Thu Jun 13 00:05:38.083 [initandlisten] connection accepted from 172.16.223.24:59235 #127832 (5 connections now open)
Thu Jun 13 00:05:38.083 [conn127832] authenticate db: wse { authenticate: 1, user: "uuapp", nonce: "xxxxxxxxx", key: "yyyyyyyyyyyyyy" }
Thu Jun 13 00:05:38.086 [conn127832] build index wse.msd20130613 { _id: 1 }
Thu Jun 13 00:05:38.086 [conn3566] getMore: cursorid not found local.oplog.rs 1958166279006819049
Thu Jun 13 00:05:38.098 [conn127832] build index done. scanned 0 total records. 0.011 secs
Thu Jun 13 00:05:38.098 [conn127832] local.oplog.rs Assertion failure firstEmpty >= 0 src/mongo/util/unordered_fast_key_table_internal.h 94
0xdd2331 0xd9350d 0xc40318 0x8b68f8 0x822216 0xa69cb2 0xa66849 0x9f0923 0x9f147c 0x9f7cae 0x6e7978 0xdbea9e 0x381ea07851 0x381e6e767d 
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdd2331]
/usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xfd) [0xd9350d]
/usr/bin/mongod(_ZN5mongo3Top6recordERKNS_10StringDataEiixb+0x9e8) [0xc40318]
/usr/bin/mongod(_ZNK5mongo5CurOp16recordGlobalTimeEx+0x68) [0x8b68f8]
/usr/bin/mongod(_ZN5mongo6Client7ContextD1Ev+0x46) [0x822216]
/usr/bin/mongod() [0xa69cb2]
/usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pbb+0x49) [0xa66849]
/usr/bin/mongod(_ZN5mongo14checkAndInsertEPKcRNS_7BSONObjE+0x133) [0x9f0923]
/usr/bin/mongod(_ZN5mongo14receivedInsertERNS_7MessageERNS_5CurOpE+0xa2c) [0x9f147c]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xabe) [0x9f7cae]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6e7978]
/usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdbea9e]
/lib64/libpthread.so.0() [0x381ea07851]
/lib64/libc.so.6(clone+0x6d) [0x381e6e767d]
Thu Jun 13 00:05:38.105 [conn4] Assertion: 10334:BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: EOO
Thu Jun 13 00:05:38.105 Invalid access at address: 0x6900 from thread: conn127832
0xdd2331 0xd93c6b 0xd941ac 0x6ebf2f 0xa51ebd 0xa5289b 0x810411 0xa7c9a6 0x9f47f3 0x9f79f7 0x6e7978 0xdbea9e 0x381ea07851 0x381e6e767d 
Thu Jun 13 00:05:38.105 Got signal: 11 (Segmentation fault).
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdd2331]
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xd93c6b]
/usr/bin/mongod() [0xd941ac]
/usr/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x5bf) [0x6ebf2f]
/usr/bin/mongod(_ZNK5mongo19CoveredIndexMatcher7matchesERKNS_7BSONObjERKNS_7DiskLocEPNS_12MatchDetailsEb+0x20d) [0xa51ebd]
/usr/bin/mongod(_ZNK5mongo19CoveredIndexMatcher14matchesCurrentEPNS_6CursorEPNS_12MatchDetailsE+0xeb) [0xa5289b]
/usr/bin/mongod(_ZN5mongo6Cursor14currentMatchesEPNS_12MatchDetailsE+0x41) [0x810411]
/usr/bin/mongod(_ZN5mongo14processGetMoreEPKcixRNS_5CurOpEiRbPb+0x406) [0xa7c9a6]
/usr/bin/mongod(_ZN5mongo15receivedGetMoreERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0xec3) [0x9f47f3]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x807) [0x9f79f7]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6e7978]
/usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdbea9e]
/lib64/libpthread.so.0() [0x381ea07851]
/lib64/libc.so.6(clone+0x6d) [0x381e6e767d]
Thu Jun 13 00:05:38.143 Backtrace:
0xdd2331 0x6cfb19 0x6d00a2 0x381ea0f500 0x381e731aff 0x382a292b81 0xdacead 0xd93482 0xc40318 0x8b68f8 0x822216 0x9f14e8 0x9f7cae 0x6e7978 0xdbea9e 0x381ea07851 0x381e6e767d 
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdd2331]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6cfb19]
/usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x262) [0x6d00a2]
/lib64/libpthread.so.0() [0x381ea0f500]
/lib64/libc.so.6() [0x381e731aff]
/usr/lib64/libstdc++.so.6(_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc+0x21) [0x382a292b81]
/usr/bin/mongod(_ZN5mongo9LogstreamlsEPKc+0xd) [0xdacead]
/usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0x72) [0xd93482]
/usr/bin/mongod(_ZN5mongo3Top6recordERKNS_10StringDataEiixb+0x9e8) [0xc40318]
/usr/bin/mongod(_ZNK5mongo5CurOp16recordGlobalTimeEx+0x68) [0x8b68f8]
/usr/bin/mongod(_ZN5mongo6Client7ContextD1Ev+0x46) [0x822216]
/usr/bin/mongod(_ZN5mongo14receivedInsertERNS_7MessageERNS_5CurOpE+0xa98) [0x9f14e8]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xabe) [0x9f7cae]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6e7978]
/usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdbea9e]
/lib64/libpthread.so.0() [0x381ea07851]
/lib64/libc.so.6(clone+0x6d) [0x381e6e767d]



 Comments   
Comment by Eliot Horowitz (Inactive) [ 27/Nov/13 ]

Sorry for not cleaning this up, but this was resolved in 2.4.5 as part of SERVER-9824

If you are having an issue in 2.4.5 or later, please open a new ticket.

Comment by Neeraj Punmiya [ 27/Nov/13 ]

This is blocker bug in our environment, We were forced to downgrade to lower version as per suggestion.
Almost 6 months, Bug is still open and unresolved.

Are Mongo designers serious about this issue or not?

Is there any plan to fix this bug?

Comment by Tad Marshall [ 13/Jun/13 ]

No, the code involved is the same for 2.4.0 through 2.4.4 ... 2.4.2 will have the same issue.

Comment by Neeraj Punmiya [ 13/Jun/13 ]

We are relying on the user management capabilities added in 2.4.x.
Since bug 9824 affects version 2.4.3, do you think downgrading to 2.4.2 will solve our problem?

Comment by Tad Marshall [ 13/Jun/13 ]

The 2.2.4 version does not include the unordered_fast_key_table_internal.h code that is giving you the assertion and 2.2.4 is a supported current version.

If you are able to try it in a testing environment using a copy of the databases from your production system before deploying, that would be the best course to guarantee that you will see no issues in downgrading, but it should just work.

Please check the release note for 2.2 and 2.4 for any specific instructions for makng a smooth downgrade.

Comment by Neeraj Punmiya [ 13/Jun/13 ]

You also hit "[rsHealthPoll] replset info 172.16.223.41:27017 heartbeat failed, retrying" 1157 times, so it seems that a few things are not working correctly for you.

We had started 172.16.223.41 after some time, therefore heartbeat failures seems to be alright.

Were you using an earlier version of MongoDB before this roll-out?

This was fresh instance placed in production first time with 2.4.4.
We cannot wait for next stable release.

Which version we should rollback to?

Please suggest.

Comment by Tad Marshall [ 13/Jun/13 ]

The assertion in SERVER-9824 is triggered by an incorrect assumption in the code about how keys will be stored in a hash table, and hitting it (or not) is completely data-dependent, so it's hard to draw conclusions about when it will be hit and when it won't be hit.

Querying a non-existent collection should not be a problem; you should get zero results and there should not be any additional issues.

If avoiding querying a non-existent collection prevents you from hitting SERVER-9824, then that sounds like a fine workaround.

Comment by Tad Marshall [ 13/Jun/13 ]

Thanks for the log.

Your log shows that you hit the unordered_fast_key_table_internal.h assertion 461 times, including the one that segfaulted. This error should abort whatever operation triggered the assertion, and the failure should have been returned to the originator of the operation.

You also hit "[rsHealthPoll] replset info 172.16.223.41:27017 heartbeat failed, retrying" 1157 times, so it seems that a few things are not working correctly for you.

We'll look at the stack trace from the segfault and see if the error handling can be improved, though this bug may not happen once the fix to SERVER-9824 is in.

Were you using an earlier version of MongoDB before this rollout?

Since your workload is hitting SERVER-9824 pretty hard, you might do better for now on an earlier version if that is an option.

Comment by Neeraj Punmiya [ 13/Jun/13 ]

We faced assertion failure "firstEmpty >= 0" earlier also. Part of attached log:

Mon Jun 10 12:36:24.690 [conn33179]  wse.reg20000202 Assertion failure firstEmpty >= 0 src/mongo/util/unordered_fast_key_table_internal.h 94
0xdd2331 0xd9350d 0xc40318 0x8b68f8 0x822216 0xa8285b 0x9f6059 0x9f7572 0x6e7978 0xdbea9e 0x381ea07851 0x381e6e767d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdd2331]
 /usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xfd) [0xd9350d]
 /usr/bin/mongod(_ZN5mongo3Top6recordERKNS_10StringDataEiixb+0x9e8) [0xc40318]
 /usr/bin/mongod(_ZNK5mongo5CurOp16recordGlobalTimeEx+0x68) [0x8b68f8]
 /usr/bin/mongod(_ZN5mongo6Client7ContextD1Ev+0x46) [0x822216]
 /usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x1b4b) [0xa8285b]
 /usr/bin/mongod() [0x9f6059]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x382) [0x9f7572]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6e7978]
 /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdbea9e]
 /lib64/libpthread.so.0() [0x381ea07851]
 /lib64/libc.so.6(clone+0x6d) [0x381e6e767d]
Mon Jun 10 12:36:24.710 [conn33179] assertion 0 assertion src/mongo/util/unordered_fast_key_table_internal.h:94 ns:wse.reg20000202 query:{ lt: { $gte: new Date(949449600000), $lte: new Date(949536000000) } }

If you refer above query, our application was querying on a non-existing collection. This used to work fine earlier but in this version we faced this problem. To work around this problem, we changed our application to check the existence of collection before firing the query.

Another interesting thing is that this assertion was not raised for all non-existing collections.

Comment by Tad Marshall [ 13/Jun/13 ]

The initial assertion (local.oplog.rs Assertion failure firstEmpty >= 0 src/mongo/util/unordered_fast_key_table_internal.h 94) is probably SERVER-9824, but the segfault suggests poor error handling following the assertion. The thread that segfaulted (conn127832) is the same one that hit the unordered_fast_key_table_internal.h assertion.

Within the same millisecond (Thu Jun 13 00:05:38.105) there was an assertion on another thread (conn4): Assertion: 10334:BSONObj size: -286331154 (0xEEEEEEEE) is invalid. This indicates that a deleted record was accessed (0xEEEEEEEE is a marker for a deleted record). This may be related to the earlier assertion, or it may be a separate event.

Can you post a full log (gzipped) as an attachment to help us diagnose what happened?

SERVER-9824 has been fixed in the master branch, but has not yet been backported.

Generated at Thu Feb 08 03:21:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.