[SERVER-9048] Out of memory leads to crash and node corruption. Created: 21/Mar/13  Updated: 10/Dec/14  Resolved: 21/Mar/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Leif Mortenson Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

root@x-test2:/opt/y-cn# uname -a
Linux x-test2.tsl.local 2.6.32-5-amd64 #1 SMP Mon Oct 3 03:59:20 UTC 2011 x86_64 GNU/Linux
root@x-test2:/opt/y-cn# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 16382
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited


Issue Links:
Duplicate
duplicates SERVER-6816 Improve journal data handling after m... Closed
Operating System: Linux
Participants:

 Description   

I am pretty sure this is a memory problem. We have a 4GB+4GBswap system and we will try resolving by increasing to 8GB+8GBswap.

This has happened to us a couple times and results in needing to rebuild the node. Both times, it wiped out 2 of 3 nodes in the cluster. This leads to the remaining server downgrading itself from primary to secondary. I am worried about what would happen if this happened on all 3 nodes at the same time.

Is there any way to make Mongo more resilient about these problems and fail more gracefully.

This is what we get in the mongo.log file:
Thu Mar 21 15:14:01 [journal] warning assertion failure a <= 256*1024*1024 src/mongo/util/alignedbuilder.cpp 90
0xade6e1 0x802c5a 0x77dc73 0x753da5 0x7540b4 0xa09950 0xa0a779 0xa0ae24 0x7c3659 0x7f38ad18a8ca 0x7f38ac53db6d
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
/usr/bin/mongod(_ZN5mongo9wassertedEPKcS1_j+0x11a) [0x802c5a]
/usr/bin/mongod(_ZN5mongo14AlignedBuilder14growReallocateEj+0x63) [0x77dc73]
/usr/bin/mongod() [0x753da5]
/usr/bin/mongod(_ZN5mongo3dur13PREPLOGBUFFERERNS0_11JSectHeaderERNS_14AlignedBuilderE+0x214) [0x7540b4]
/usr/bin/mongod(_ZN5mongo3dur27groupCommitWithLimitedLocksEv+0xa0) [0xa09950]
/usr/bin/mongod() [0xa0a779]
/usr/bin/mongod(_ZN5mongo3dur9durThreadEv+0x364) [0xa0ae24]
/usr/bin/mongod() [0x7c3659]
/lib/libpthread.so.0(+0x68ca) [0x7f38ad18a8ca]
/lib/libc.so.6(clone+0x6d) [0x7f38ac53db6d]
Thu Mar 21 15:14:02 [conn94583] end connection 10.1.7.11:50309 (7 connections now open)
Thu Mar 21 15:14:02 [initandlisten] connection accepted from 10.1.7.11:50314 #94585 (8 connections now open)
Thu Mar 21 15:14:10 [conn94584] end connection 10.1.7.13:43214 (7 connections now open)
Thu Mar 21 15:14:10 [initandlisten] connection accepted from 10.1.7.13:43216 #94586 (8 connections now open)
Thu Mar 21 15:14:12 [journal] warning assertion failure a <= 256*1024*1024 src/mongo/util/alignedbuilder.cpp 90
0xade6e1 0x802c5a 0x77dc73 0x753da5 0x7540b4 0xa09950 0xa0a779 0xa0ae24 0x7c3659 0x7f38ad18a8ca 0x7f38ac53db6d
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
/usr/bin/mongod(_ZN5mongo9wassertedEPKcS1_j+0x11a) [0x802c5a]
/usr/bin/mongod(_ZN5mongo14AlignedBuilder14growReallocateEj+0x63) [0x77dc73]
/usr/bin/mongod() [0x753da5]
/usr/bin/mongod(_ZN5mongo3dur13PREPLOGBUFFERERNS0_11JSectHeaderERNS_14AlignedBuilderE+0x214) [0x7540b4]
/usr/bin/mongod(_ZN5mongo3dur27groupCommitWithLimitedLocksEv+0xa0) [0xa09950]
/usr/bin/mongod() [0xa0a779]
/usr/bin/mongod(_ZN5mongo3dur9durThreadEv+0x364) [0xa0ae24]
/usr/bin/mongod() [0x7c3659]
/lib/libpthread.so.0(+0x68ca) [0x7f38ad18a8ca]
/lib/libc.so.6(clone+0x6d) [0x7f38ac53db6d]
Thu Mar 21 15:14:12 [journal] Assertion failure a <= 512*1024*1024 src/mongo/util/alignedbuilder.cpp 91
0xade6e1 0x803dfd 0x77dc8d 0x753da5 0x7540b4 0xa09950 0xa0a779 0xa0ae24 0x7c3659 0x7f38ad18a8ca 0x7f38ac53db6d
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
/usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xfd) [0x803dfd]
/usr/bin/mongod(_ZN5mongo14AlignedBuilder14growReallocateEj+0x7d) [0x77dc8d]
/usr/bin/mongod() [0x753da5]
/usr/bin/mongod(_ZN5mongo3dur13PREPLOGBUFFERERNS0_11JSectHeaderERNS_14AlignedBuilderE+0x214) [0x7540b4]
/usr/bin/mongod(_ZN5mongo3dur27groupCommitWithLimitedLocksEv+0xa0) [0xa09950]
/usr/bin/mongod() [0xa0a779]
/usr/bin/mongod(_ZN5mongo3dur9durThreadEv+0x364) [0xa0ae24]
/usr/bin/mongod() [0x7c3659]
/lib/libpthread.so.0(+0x68ca) [0x7f38ad18a8ca]
/lib/libc.so.6(clone+0x6d) [0x7f38ac53db6d]
Thu Mar 21 15:14:13 [journal] dbexception in groupCommitLL causing immediate shutdown: 0 assertion src/mongo/util/alignedbuilder.cpp:91
Thu Mar 21 15:14:13 dur1
Thu Mar 21 15:14:13 Got signal: 6 (Aborted).

Thu Mar 21 15:14:14 Backtrace:
0xade6e1 0x5582d9 0x7f38ac4a0230 0x7f38ac4a01b5 0x7f38ac4a2fc0 0xb503f7 0xa09e1f 0xa0a779 0xa0ae24 0x7c3659 0x7f38ad18a8ca 0x7f38ac53db6d
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5582d9]
/lib/libc.so.6(+0x32230) [0x7f38ac4a0230]
/lib/libc.so.6(gsignal+0x35) [0x7f38ac4a01b5]
/lib/libc.so.6(abort+0x180) [0x7f38ac4a2fc0]
/usr/bin/mongod(_ZN5mongo10mongoAbortEPKc+0x47) [0xb503f7]
/usr/bin/mongod(_ZN5mongo3dur27groupCommitWithLimitedLocksEv+0x56f) [0xa09e1f]
/usr/bin/mongod() [0xa0a779]
/usr/bin/mongod(_ZN5mongo3dur9durThreadEv+0x364) [0xa0ae24]
/usr/bin/mongod() [0x7c3659]
/lib/libpthread.so.0(+0x68ca) [0x7f38ad18a8ca]
/lib/libc.so.6(clone+0x6d) [0x7f38ac53db6d]



 Comments   
Comment by Leif Mortenson [ 21/Mar/13 ]

Andy,
Thank you for the quick response. We upgraded to 2.4.0 last night and are currently rebuilding the failed nodes. It seems to be going smoothly so far.
Cheers

Comment by Andy Schwerin [ 21/Mar/13 ]

This appears to be a duplicate of SERVER-6816, which an upgrade to a newer 2.2 release would fix. SERVER-6816 only manifests on nodes acting as secondaries, which sounds consistent with your experience.

Comment by Scott Hernandez (Inactive) [ 21/Mar/13 ]

There are many fixes in the 2.2 stable release between 2.2.0 and 2.2.3 (and soon 2.2.4) which might be helpful. Please upgrade to 2.2.3 at the very least.

Are these nodes monitored with MMS? If so, can you please post the link to them?

Why do you need to rebuild the nodes and how are you doing this? With journaling on all that you need to do is to restart the servers and they should immediately catch up and recovery as an active member in the replica set.

Please include the full logs from one of these failures from well before the error and after the node has been restarted.

Generated at Thu Feb 08 03:19:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.