[SERVER-9922] Mongod instance (2.4.4) crashed with the following errors with segmentation fault Created: 13/Jun/13 Updated: 10/Dec/14 Resolved: 27/Nov/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Neeraj Punmiya | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Mongo instances are running in replication mode. |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | Linux | ||||||||||||
| Steps To Reproduce: | Crashed at midnight. Log attached for analysis |
||||||||||||
| Participants: | |||||||||||||
| Description |
|
Server log
|
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 27/Nov/13 ] | |||||||||||||||
|
Sorry for not cleaning this up, but this was resolved in 2.4.5 as part of If you are having an issue in 2.4.5 or later, please open a new ticket. | |||||||||||||||
| Comment by Neeraj Punmiya [ 27/Nov/13 ] | |||||||||||||||
|
This is blocker bug in our environment, We were forced to downgrade to lower version as per suggestion. Are Mongo designers serious about this issue or not? Is there any plan to fix this bug? | |||||||||||||||
| Comment by Tad Marshall [ 13/Jun/13 ] | |||||||||||||||
|
No, the code involved is the same for 2.4.0 through 2.4.4 ... 2.4.2 will have the same issue. | |||||||||||||||
| Comment by Neeraj Punmiya [ 13/Jun/13 ] | |||||||||||||||
|
We are relying on the user management capabilities added in 2.4.x. | |||||||||||||||
| Comment by Tad Marshall [ 13/Jun/13 ] | |||||||||||||||
|
The 2.2.4 version does not include the unordered_fast_key_table_internal.h code that is giving you the assertion and 2.2.4 is a supported current version. If you are able to try it in a testing environment using a copy of the databases from your production system before deploying, that would be the best course to guarantee that you will see no issues in downgrading, but it should just work. Please check the release note for 2.2 and 2.4 for any specific instructions for makng a smooth downgrade. | |||||||||||||||
| Comment by Neeraj Punmiya [ 13/Jun/13 ] | |||||||||||||||
We had started 172.16.223.41 after some time, therefore heartbeat failures seems to be alright.
This was fresh instance placed in production first time with 2.4.4. Which version we should rollback to? Please suggest. | |||||||||||||||
| Comment by Tad Marshall [ 13/Jun/13 ] | |||||||||||||||
|
The assertion in Querying a non-existent collection should not be a problem; you should get zero results and there should not be any additional issues. If avoiding querying a non-existent collection prevents you from hitting | |||||||||||||||
| Comment by Tad Marshall [ 13/Jun/13 ] | |||||||||||||||
|
Thanks for the log. Your log shows that you hit the unordered_fast_key_table_internal.h assertion 461 times, including the one that segfaulted. This error should abort whatever operation triggered the assertion, and the failure should have been returned to the originator of the operation. You also hit "[rsHealthPoll] replset info 172.16.223.41:27017 heartbeat failed, retrying" 1157 times, so it seems that a few things are not working correctly for you. We'll look at the stack trace from the segfault and see if the error handling can be improved, though this bug may not happen once the fix to Were you using an earlier version of MongoDB before this rollout? Since your workload is hitting | |||||||||||||||
| Comment by Neeraj Punmiya [ 13/Jun/13 ] | |||||||||||||||
|
We faced assertion failure "firstEmpty >= 0" earlier also. Part of attached log:
If you refer above query, our application was querying on a non-existing collection. This used to work fine earlier but in this version we faced this problem. To work around this problem, we changed our application to check the existence of collection before firing the query. Another interesting thing is that this assertion was not raised for all non-existing collections. | |||||||||||||||
| Comment by Tad Marshall [ 13/Jun/13 ] | |||||||||||||||
|
The initial assertion (local.oplog.rs Assertion failure firstEmpty >= 0 src/mongo/util/unordered_fast_key_table_internal.h 94) is probably Within the same millisecond (Thu Jun 13 00:05:38.105) there was an assertion on another thread (conn4): Assertion: 10334:BSONObj size: -286331154 (0xEEEEEEEE) is invalid. This indicates that a deleted record was accessed (0xEEEEEEEE is a marker for a deleted record). This may be related to the earlier assertion, or it may be a separate event. Can you post a full log (gzipped) as an attachment to help us diagnose what happened?
|