[SERVER-11158] Crash on startup Created: 12/Oct/13  Updated: 11/Jul/16  Resolved: 04/Nov/13

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 2.4.4
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Dwayne Bull Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu


Operating System: ALL
Steps To Reproduce:

run
/usr/bin/mongod -vvvvvvvvvvvv

Participants:

 Description   

New server build from an image of a working server. Attempted on several new servers with the same result.
The only difference is the new server has an AMD 4xxx series CPU.

Only happens on mongod, mongos starts fine.

/usr/bin/mongod -vvvvvvvvvvvv
Sat Oct 12 15:33:41.506 versionArrayTest passed
Sat Oct 12 15:33:41.507 shardKeyTest passed
Sat Oct 12 15:33:41.507 isInRangeTest passed
Sat Oct 12 15:33:41.507 shardObjTest passed
Sat Oct 12 15:33:41.507 Matcher::matches()

{ abcd: 3.1, abcdef: "123456789" }

Sat Oct 12 15:33:41.507 Matcher::matches()

{ abcd: 3.1, abcdef: "123456789" }

Sat Oct 12 15:33:41.507 Matcher::matches()

{ abcd: 3.1, abcdef: "123456789" }

Sat Oct 12 15:33:41.507 Matcher::matches()

{ abcdef: "z23456789" }

Sat Oct 12 15:33:41.507 Matcher::matches()

{ abcd: 3.1, abcdef: "123456789" }

Sat Oct 12 15:33:41.507 Matcher::matches()

{ abcdef: "z23456789" }

Sat Oct 12 15:33:41.506 BackgroundJob starting: DataFileSync
Sat Oct 12 15:33:41.513 Invalid operation at address: 0x7fb350725021 from thread:

Sat Oct 12 15:33:41.513 Got signal: 4 (Illegal instruction).

Sat Oct 12 15:33:41.517 Backtrace:
0xdd2331 0x6cfb19 0x6d00a2 0x7fb350eefcb0 0x7fb350725021 0x9b9180 0x9b9653 0x983156 0xdd2511 0x6dcdb1 0x6dea19 0x7fb35012176d 0x6ce789
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdd2331]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6cfb19]
/usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x262) [0x6d00a2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7fb350eefcb0]
/lib/x86_64-linux-gnu/libm.so.6(+0x4d021) [0x7fb350725021]
/usr/bin/mongod(ZN5mongo14spheredist_radERKNS_5PointES2+0x20) [0x9b9180]
/usr/bin/mongod(ZN5mongo14spheredist_degERKNS_5PointES2+0x73) [0x9b9653]
/usr/bin/mongod(_ZN5mongo11GeoUnitTest3runEv+0x3856) [0x983156]
/usr/bin/mongod(_ZN5mongo11StartupTest8runTestsEv+0x31) [0xdd2511]
/usr/bin/mongod() [0x6dcdb1]
/usr/bin/mongod(main+0x9) [0x6dea19]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fb35012176d]
/usr/bin/mongod(__gxx_personality_v0+0x499) [0x6ce789]



 Comments   
Comment by Dwayne Bull [ 14/Oct/13 ]

I have had info back from rackspace as its a concern about the hypervisors:

"We have different hosts hardware and 2 different versions of hypervisors xensever 6.0 and xenserver 6.1. but when you build a vm it is choosen randomly for the best available on our system, we dont have specific servers for different flavors."

Edit: We have confirmed that xenserver 6.1 is the issue, updates to the core Ubuntu packages fix the error.

Comment by Dwayne Bull [ 13/Oct/13 ]

I've taken more images from working servers and I'm still getting this issue.

Edit: After apt-get update & upgrade everything works fine, although I don't see what needs to be updated given the image is from a working server and just updating mongodb doesn't work.

Comment by Dwayne Bull [ 13/Oct/13 ]

Turns out the processor isn't the issue, I just booted up a few servers till I hit on one with a AMD Opteron(tm) Processor 4332 HE.
A clean version of Ubuntu 12.04 and Mongodb 2.4.4 ran fine.

Comment by Dwayne Bull [ 12/Oct/13 ]

Md5 matched.
Run result:

root@nginx-dev-colonyattack:/var/mon/mongodb-linux-x86_64-2.4.6/bin# mongod -vvvvvvvvvv
Sat Oct 12 20:12:16.359 versionArrayTest passed
Sat Oct 12 20:12:16.360 shardKeyTest passed
Sat Oct 12 20:12:16.360 isInRangeTest passed
Sat Oct 12 20:12:16.399 shardObjTest passed
Sat Oct 12 20:12:16.389 BackgroundJob starting: DataFileSync
Sat Oct 12 20:12:16.418 Matcher::matches()

{ abcd: 3.1, abcdef: "123456789" }

Sat Oct 12 20:12:16.418 Matcher::matches()

{ abcd: 3.1, abcdef: "123456789" }

Sat Oct 12 20:12:16.418 Matcher::matches()

{ abcd: 3.1, abcdef: "123456789" }

Sat Oct 12 20:12:16.418 Matcher::matches()

{ abcdef: "z23456789" }

Sat Oct 12 20:12:16.418 Matcher::matches()

{ abcd: 3.1, abcdef: "123456789" }

Sat Oct 12 20:12:16.418 Matcher::matches()

{ abcdef: "z23456789" }

Sat Oct 12 20:12:16.425 Invalid operation at address: 0x7f0932695021 from thread:

Sat Oct 12 20:12:16.425 Got signal: 4 (Illegal instruction).

Sat Oct 12 20:12:16.429 Backtrace:
0xdd2331 0x6cfb19 0x6d00a2 0x7f0932e5fcb0 0x7f0932695021 0x9b9180 0x9b9653 0x983156 0xdd2511 0x6dcdb1 0x6dea19 0x7f093209176d 0x6ce789
mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdd2331]
mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6cfb19]
mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x262) [0x6d00a2]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f0932e5fcb0]
/lib/x86_64-linux-gnu/libm.so.6(+0x4d021) [0x7f0932695021]
mongod(ZN5mongo14spheredist_radERKNS_5PointES2+0x20) [0x9b9180]
mongod(ZN5mongo14spheredist_degERKNS_5PointES2+0x73) [0x9b9653]
mongod(_ZN5mongo11GeoUnitTest3runEv+0x3856) [0x983156]
mongod(_ZN5mongo11StartupTest8runTestsEv+0x31) [0xdd2511]
mongod() [0x6dcdb1]
mongod(main+0x9) [0x6dea19]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f093209176d]
mongod(__gxx_personality_v0+0x499) [0x6ce789]

Comment by Eliot Horowitz (Inactive) [ 12/Oct/13 ]

Can you try http://fastdl.mongodb.org/linux/mongodb-linux-x86_64-2.4.6.tgz
md5 is http://fastdl.mongodb.org/linux/mongodb-linux-x86_64-2.4.6.tgz.md5

Comment by Dwayne Bull [ 12/Oct/13 ]

It's version 2.4.4 from the ubuntu repo. I've tried with the latest version too but that had the same result.
How do I md5 from that?

Comment by Eliot Horowitz (Inactive) [ 12/Oct/13 ]

Is this an official binary?
If so, can you md5 it just so we can make sure it the same.
(you can get the md5 by adding ".md5" to the official download url)

The next thing to try would be compiling on that machine and seeing if it works.

Comment by Dwayne Bull [ 12/Oct/13 ]

Here are the differences in the cpu flags ( removed common flags between working and not )

Not working servers have these flags:
up aperfmper pclmulqdq ssse3 fma sse4_1 sse4_2 aes f16c xop fma4 tce tbm perfctr_core arat cpb

Working servers have these flags:
3dnowext 3dnow

I really don't know if this is anything to do with the issue, but if it helps..

Comment by Dwayne Bull [ 12/Oct/13 ]

If it helps I ran a few more cloud servers, here are the results:

Working 30g server
processor : 7
vendor_id : AuthenticAMD
cpu family : 16
model : 8
model name : AMD Opteron(tm) Processor 4170 HE
stepping : 1
microcode : 0x10000d9
cpu MHz : 2094.800
cache size : 512 KB
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good nopl pni cx16 popcnt hypervisor lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch
bogomips : 4189.60
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Not working 2g & 0.5g server
processor : 0
vendor_id : AuthenticAMD
cpu family : 21
model : 2
model name : AMD Opteron(tm) Processor 4332 HE
stepping : 0
microcode : 0x600081f
cpu MHz : 3000.128
cache size : 2048 KB
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm up rep_good nopl aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 popcnt aes f16c hypervisor lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch xop fma4 tce tbm perfctr_core arat cpb
bogomips : 6000.25
TLB size : 1536 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

Working 2g server
processor : 1
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : Quad-Core AMD Opteron(tm) Processor 2374 HE
stepping : 2
microcode : 0x1000086
cpu MHz : 2200.106
cache size : 512 KB
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu de tsc msr pae cx8 cmov pat clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good nopl pni cx16 popcnt hypervisor lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch
bogomips : 4400.21
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Generated at Thu Feb 08 03:25:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.