[SERVER-4249] abort after invalid next size error Created: 10/Nov/11  Updated: 16/May/12  Resolved: 16/May/12

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.0.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Anton Winter Assignee: Eric Milkie
Resolution: Duplicate Votes: 0
Labels: crash
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 10.04.3 LTS
http://downloads-distro.mongodb.org/repo/ubuntu-upstart/dists/dist/10gen/binary-amd64/mongodb-10gen_2.0.1_amd64.deb


Attachments: Text File mongodb-crash-2.log     Text File mongodb-crash-3.log     Text File mongodb-crash.log    
Issue Links:
Duplicate
duplicates SERVER-4609 StringBuilder signed integer max buff... Closed
Related
is related to SERVER-4609 StringBuilder signed integer max buff... Closed
Operating System: Linux
Participants:

 Description   

I experienced an abort and hang after an "invalid next size" error running 2.0.1 from the official 10gen ubuntu package. The mongod instance had to be killed as it would not stop. The server this occurred on is the master in a master/slave configuration.

The log:

      • glibc detected *** /usr/bin/mongod: free(): invalid next size (normal): 0x00007f2b5c0073f0 ***
        ======= Backtrace: =========
        /lib/libc.so.6(+0x775b6)[0x7f3dbcf1d5b6]
        /lib/libc.so.6(+0x7db23)[0x7f3dbcf23b23]
        /lib/libc.so.6(realloc+0xf0)[0x7f3dbcf240b0]
        /usr/bin/mongod(_ZN5mongo11_BufBuilderINS_16TrivialAllocatorEE15grow_reallocateEv+0x42)[0x504232]
        /usr/bin/mongod(_ZNK5mongo7BSONObj8toStringERNS_13StringBuilderEbb+0x30b)[0x50a7db]
        /usr/bin/mongod(_ZNK5mongo11BSONElement8toStringERNS_13StringBuilderEbb+0x5d0)[0x508e40]
        /usr/bin/mongod(_ZNK5mongo7BSONObj8toStringERNS_13StringBuilderEbb+0x1d2)[0x50a6a2]
        /usr/bin/mongod(_ZNK5mongo11BSONElement8toStringERNS_13StringBuilderEbb+0x11b2)[0x509a22]
        /usr/bin/mongod(_ZNK5mongo7BSONObj8toStringERNS_13StringBuilderEbb+0x1d2)[0x50a6a2]
        /usr/bin/mongod(_ZNK5mongo7OpDebug8toStringEv+0x1fc)[0x8938fc]
        /usr/bin/mongod(_ZNK5mongo14LazyStringImplINS_7OpDebugEE3valEv+0xd)[0x88c31d]
        /usr/bin/mongod(_ZN5mongo9LogstreamlsERKNS_10LazyStringE+0x1a)[0x50a80a]
        /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x95c)[0x88902c]
        /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76)[0xa9c576]
        /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287)[0x638937]
        /lib/libpthread.so.0(+0x69ca)[0x7f3dbd9dd9ca]
        /lib/libc.so.6(clone+0x6d)[0x7f3dbcf8c70d]

[snip]

Thu Nov 10 15:45:43 Got signal: 6 (Aborted).

Thu Nov 10 15:45:43 Backtrace:
0xa89b19 0x7f3dbced9af0 0x7f3dbced9a75 0x7f3dbcedd5c0 0x7f3dbcf134fb 0x7f3dbcf1d5b6 0x7f3dbcf23e83 0x67e2e4 0x67d0cc 0x75bc79 0x67dcf1 0x885441 0x885ae0 0xa89c76 0x7f3dbced9af0 0x7f3dbced9a75 0x7f3dbcedd5c0 0x7f3dbcf134fb 0x7f3dbcf1d5b6 0x7f3dbcf23b23
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa89b19]
/lib/libc.so.6(+0x33af0) [0x7f3dbced9af0]
/lib/libc.so.6(gsignal+0x35) [0x7f3dbced9a75]
/lib/libc.so.6(abort+0x180) [0x7f3dbcedd5c0]
/lib/libc.so.6(+0x6d4fb) [0x7f3dbcf134fb]
/lib/libc.so.6(+0x775b6) [0x7f3dbcf1d5b6]
/lib/libc.so.6(cfree+0x73) [0x7f3dbcf23e83]
/usr/bin/mongod(ZNSt8_Rb_treeIPN5mongo9MongoFileES2_St9_IdentityIS2_ESt4lessIS2_ESaIS2_EE5eraseERKS2+0x144) [0x67e2e4]
/usr/bin/mongod(_ZN5mongo9MongoFile9destroyedEv+0x2c) [0x67d0cc]
/usr/bin/mongod(_ZN5mongo8MongoMMF5closeEv+0x59) [0x75bc79]
/usr/bin/mongod(_ZN5mongo9MongoFile13closeAllFilesERSt18basic_stringstreamIcSt11char_traitsIcESaIcEE+0x151) [0x67dcf1]
/usr/bin/mongod(_ZN5mongo14shutdownServerEv+0x6d1) [0x885441]
/usr/bin/mongod(_ZN5mongo6dbexitENS_8ExitCodeEPKcb+0x2a0) [0x885ae0]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x4f6) [0xa89c76]
/lib/libc.so.6(+0x33af0) [0x7f3dbced9af0]
/lib/libc.so.6(gsignal+0x35) [0x7f3dbced9a75]
/lib/libc.so.6(abort+0x180) [0x7f3dbcedd5c0]
/lib/libc.so.6(+0x6d4fb) [0x7f3dbcf134fb]
/lib/libc.so.6(+0x775b6) [0x7f3dbcf1d5b6]
/lib/libc.so.6(+0x7db23) [0x7f3dbcf23b23]

Full log information from the time of the crash is attached.



 Comments   
Comment by Eric Milkie [ 03/Feb/12 ]

Resolving as duplicate. Fix is in 2.1.0.

Comment by Eric Milkie [ 16/Jan/12 ]

Hi Anton,
The stack trace from the latest crash does look like it might be the bug that is fixed in SERVER-4609. There is an off-by-one error allocating a buffer to store a string representation of a number. If that number happens to be a very low negative number, it can exceed the size of the buffer and corrupt the memory block footer such that free() reports an error. Such negative numbers can be seen in the description for SERVER-4192.
This bug is fixed in the next version, which will be released soon.

Comment by Daniel Pasette (Inactive) [ 16/Jan/12 ]

Sorry for the delayed response Anton, I was out of the office. Not sure I have enough to go on here, but this looks like the same stack trace as reported in SERVER-4192, which is provisionally attributed to SERVER-4609. Assigning this to someone with better domain knowledge.

@eric, mms account is Brandscreen: https://mms.10gen.com/host/list/4e962f07ae6429bfa40fc821

Comment by Anton Winter [ 10/Jan/12 ]

2.0.1

Comment by Eliot Horowitz (Inactive) [ 10/Jan/12 ]

Was this with 2.0.1 or 2.0.2?

Comment by Anton Winter [ 10/Jan/12 ]

Just had another similar crash, most recent log attached.

Comment by Daniel Pasette (Inactive) [ 28/Dec/11 ]

Looks ok. Can you post to this ticket if there is another crash?

Comment by Anton Winter [ 27/Dec/11 ]

Done.

Comment by Daniel Pasette (Inactive) [ 27/Dec/11 ]

Hi Anton, is it possible to re-enable MMS including the hardware stats (see: http://mms.10gen.com/help/install.html?highlight=munin#hardware-monitoring-with-munin-node) for your cluster?

Comment by Anton Winter [ 19/Dec/11 ]

We continue to see these crashes (only on the master, not the slaves). There is sufficient swap available so there has to be another reason for this. Is there any other debug information can I gather that could help pinpoint why these crashes are occurring?

Comment by Anton Winter [ 28/Nov/11 ]

There is nothing in syslog to suggest this. These servers are dedicated to mongodb only. If looking at that mms, the first crash actually occurred at 1620hrs 27/11/11 (UTC), the second crash at 2055 and then was eventually failed over to the other host at ~0030, if that helps interpret the chart oddities.

Comment by Tony Hannan [ 28/Nov/11 ]

Hi Anton,
See memory chart in https://mms.10gen.com/host/detail/c0b8601468f23dedb014a25a135263af#chart5Minute
Apparently, something is pushing mongod out of RAM? Do you have another process consuming memory? Check syslog/messages.

Comment by Anton Winter [ 28/Nov/11 ]

If observing the machines registered in mms, after the several crashes this morning I failed over from the host 2.db to 1.db so 1.db is now the master.

Comment by Anton Winter [ 27/Nov/11 ]

Identical crashes have been occurring again this morning. Sufficient swap had been added to the server as recommended however the machine had not swapped at the time of the crashes.

Comment by Tony Hannan [ 23/Nov/11 ]

Hi Anton,

I would like to keep an eye on your server to see if there is a memory leak. I see Brandscreen has 3 machines registered in MMS.

Which one ran out of memory?

Comment by Tony Hannan [ 22/Nov/11 ]

That is probably the problem. Please add some swap space.

Comment by Anton Winter [ 21/Nov/11 ]

None

Comment by Tony Hannan [ 21/Nov/11 ]

How much swap space do you have?

Comment by Anton Winter [ 21/Nov/11 ]

and it occurred again on that same master, this time with both signal 6 and 11's & more log information. Relevant section of that log is attached.

Comment by Anton Winter [ 21/Nov/11 ]

The logs preceding the crash when I looked at the time was just client connection log messages.

An identical signal 6 crash happened twice, the second time it occurred I promoted a nearby slave to master and resynced the original master as a slave. Its been running fine until just now where a segfault occurred on the new master. In this particular case they are both AWS EC2 m2.4xl's (70Gb memory).

While this 3rd crash is a signal 11, as opposed to the originally reported multiple signal 6's it is too coincidental for them to be unrelated to I've pasted the log output below along with preceding log entries.

Mon Nov 21 11:05:32 [conn892896] end connection 10.x.x.x:43345
Mon Nov 21 11:05:40 [initandlisten] connection accepted from 66.x.x.x:25663 #892898
Mon Nov 21 11:05:40 [conn892898] end connection 66.x.x.x:25663
Mon Nov 21 11:05:42 [initandlisten] connection accepted from 122.x.x.x:46403 #892899
Mon Nov 21 11:05:43 [conn892899] end connection 122.x.x.x:46403
Mon Nov 21 11:05:46 [initandlisten] connection accepted from 66.x.x.x:10673 #892900
Mon Nov 21 11:05:46 [conn892900] end connection 66.x.x.x:10673
Mon Nov 21 11:05:47 [initandlisten] connection accepted from 66.x.x.x:20264 #892901
Mon Nov 21 11:05:48 Invalid access at address: 0x3c

Mon Nov 21 11:05:48 Got signal: 11 (Segmentation fault).

Mon Nov 21 11:05:48 Backtrace:
0xa89b19 0xa8a0f0 0x7f888b58b8f0 0x7f888aac26bd 0x7f888aac5fad 0x7f888aac758e 0x961ff9 0x884864 0x88963a 0xa9c576 0x638937 0x7f888b5829ca 0x7f888ab3170d
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa89b19]
/usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0xa8a0f0]
/lib/libpthread.so.0(+0xf8f0) [0x7f888b58b8f0]
/lib/libc.so.6(+0x776bd) [0x7f888aac26bd]
/lib/libc.so.6(+0x7afad) [0x7f888aac5fad]
/lib/libc.so.6(__libc_malloc+0x6e) [0x7f888aac758e]
/usr/bin/mongod(_ZN5mongo14processGetMoreEPKcixRNS_5CurOpEiRb+0x109) [0x961ff9]
/usr/bin/mongod(_ZN5mongo15receivedGetMoreERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1b4) [0x884864]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xf6a) [0x88963a]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xa9c576]
/usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x638937]
/lib/libpthread.so.0(+0x69ca) [0x7f888b5829ca]
/lib/libc.so.6(clone+0x6d) [0x7f888ab3170d]

Mon Nov 21 11:05:48 dbexit:
Mon Nov 21 11:05:48 [conn704146] shutdown: going to close listening sockets...
Mon Nov 21 11:05:48 [conn704146] closing listening socket: 28
Mon Nov 21 11:05:48 [conn704146] closing listening socket: 29
Mon Nov 21 11:05:48 [conn704146] closing listening socket: 30
Mon Nov 21 11:05:48 [conn704146] removing socket file: /tmp/mongodb-27017.sock
Mon Nov 21 11:05:48 [conn704146] shutdown: going to flush diaglog...
Mon Nov 21 11:05:48 [conn704146] shutdown: going to close sockets...
Mon Nov 21 11:05:48 [conn704146] shutdown: waiting for fs preallocator...
Mon Nov 21 11:05:48 [conn704146] shutdown: closing all files...
Mon Nov 21 11:05:48 ERROR: Client::shutdown not called: slaveTracking
Mon Nov 21 11:05:49 [conn4] got request after shutdown()
Mon Nov 21 11:05:53 [conn804706] got request after shutdown()
Mon Nov 21 11:05:55 [conn892534] got request after shutdown()
Mon Nov 21 11:05:58 [conn14791] got request after shutdown()
Mon Nov 21 11:06:11 [conn892532] got request after shutdown()

Comment by Eliot Horowitz (Inactive) [ 11/Nov/11 ]

everything looks ok now.
do you have more of the log preceding the crash?
how much ram on the box?

Comment by Anton Winter [ 11/Nov/11 ]

Indeed I have, under this account.

Comment by Eliot Horowitz (Inactive) [ 11/Nov/11 ]

Have you signed up for mms.10gen.com yet?
Would make it much easier to figure out what's going on.

Comment by Anton Winter [ 11/Nov/11 ]

Figuring out why it ran out of memory would be great.

Comment by Eliot Horowitz (Inactive) [ 10/Nov/11 ]

This is caused by the same issue as SERVER-4228 and indicates running out of memory.

If you would like help figuring out why you ran out of memory, please let us know.

Generated at Thu Feb 08 03:05:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.