[SERVER-4249] abort after invalid next size error Created: 10/Nov/11 Updated: 16/May/12 Resolved: 16/May/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Anton Winter | Assignee: | Eric Milkie |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | crash | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 10.04.3 LTS |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Operating System: | Linux | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
I experienced an abort and hang after an "invalid next size" error running 2.0.1 from the official 10gen ubuntu package. The mongod instance had to be killed as it would not stop. The server this occurred on is the master in a master/slave configuration. The log:
[snip] Thu Nov 10 15:45:43 Got signal: 6 (Aborted). Thu Nov 10 15:45:43 Backtrace: Full log information from the time of the crash is attached. |
| Comments |
| Comment by Eric Milkie [ 03/Feb/12 ] |
|
Resolving as duplicate. Fix is in 2.1.0. |
| Comment by Eric Milkie [ 16/Jan/12 ] |
|
Hi Anton, |
| Comment by Daniel Pasette (Inactive) [ 16/Jan/12 ] |
|
Sorry for the delayed response Anton, I was out of the office. Not sure I have enough to go on here, but this looks like the same stack trace as reported in @eric, mms account is Brandscreen: https://mms.10gen.com/host/list/4e962f07ae6429bfa40fc821 |
| Comment by Anton Winter [ 10/Jan/12 ] |
|
2.0.1 |
| Comment by Eliot Horowitz (Inactive) [ 10/Jan/12 ] |
|
Was this with 2.0.1 or 2.0.2? |
| Comment by Anton Winter [ 10/Jan/12 ] |
|
Just had another similar crash, most recent log attached. |
| Comment by Daniel Pasette (Inactive) [ 28/Dec/11 ] |
|
Looks ok. Can you post to this ticket if there is another crash? |
| Comment by Anton Winter [ 27/Dec/11 ] |
|
Done. |
| Comment by Daniel Pasette (Inactive) [ 27/Dec/11 ] |
|
Hi Anton, is it possible to re-enable MMS including the hardware stats (see: http://mms.10gen.com/help/install.html?highlight=munin#hardware-monitoring-with-munin-node) for your cluster? |
| Comment by Anton Winter [ 19/Dec/11 ] |
|
We continue to see these crashes (only on the master, not the slaves). There is sufficient swap available so there has to be another reason for this. Is there any other debug information can I gather that could help pinpoint why these crashes are occurring? |
| Comment by Anton Winter [ 28/Nov/11 ] |
|
There is nothing in syslog to suggest this. These servers are dedicated to mongodb only. If looking at that mms, the first crash actually occurred at 1620hrs 27/11/11 (UTC), the second crash at 2055 and then was eventually failed over to the other host at ~0030, if that helps interpret the chart oddities. |
| Comment by Tony Hannan [ 28/Nov/11 ] |
|
Hi Anton, |
| Comment by Anton Winter [ 28/Nov/11 ] |
|
If observing the machines registered in mms, after the several crashes this morning I failed over from the host 2.db to 1.db so 1.db is now the master. |
| Comment by Anton Winter [ 27/Nov/11 ] |
|
Identical crashes have been occurring again this morning. Sufficient swap had been added to the server as recommended however the machine had not swapped at the time of the crashes. |
| Comment by Tony Hannan [ 23/Nov/11 ] |
|
Hi Anton, I would like to keep an eye on your server to see if there is a memory leak. I see Brandscreen has 3 machines registered in MMS. Which one ran out of memory? |
| Comment by Tony Hannan [ 22/Nov/11 ] |
|
That is probably the problem. Please add some swap space. |
| Comment by Anton Winter [ 21/Nov/11 ] |
|
None |
| Comment by Tony Hannan [ 21/Nov/11 ] |
|
How much swap space do you have? |
| Comment by Anton Winter [ 21/Nov/11 ] |
|
and it occurred again on that same master, this time with both signal 6 and 11's & more log information. Relevant section of that log is attached. |
| Comment by Anton Winter [ 21/Nov/11 ] |
|
The logs preceding the crash when I looked at the time was just client connection log messages. An identical signal 6 crash happened twice, the second time it occurred I promoted a nearby slave to master and resynced the original master as a slave. Its been running fine until just now where a segfault occurred on the new master. In this particular case they are both AWS EC2 m2.4xl's (70Gb memory). While this 3rd crash is a signal 11, as opposed to the originally reported multiple signal 6's it is too coincidental for them to be unrelated to I've pasted the log output below along with preceding log entries. Mon Nov 21 11:05:32 [conn892896] end connection 10.x.x.x:43345 Mon Nov 21 11:05:48 Got signal: 11 (Segmentation fault). Mon Nov 21 11:05:48 Backtrace: Mon Nov 21 11:05:48 dbexit: |
| Comment by Eliot Horowitz (Inactive) [ 11/Nov/11 ] |
|
everything looks ok now. |
| Comment by Anton Winter [ 11/Nov/11 ] |
|
Indeed I have, under this account. |
| Comment by Eliot Horowitz (Inactive) [ 11/Nov/11 ] |
|
Have you signed up for mms.10gen.com yet? |
| Comment by Anton Winter [ 11/Nov/11 ] |
|
Figuring out why it ran out of memory would be great. |
| Comment by Eliot Horowitz (Inactive) [ 10/Nov/11 ] |
|
This is caused by the same issue as If you would like help figuring out why you ran out of memory, please let us know. |