[SERVER-12849] mongod got signal 7 (Bus error) Created: 23/Feb/14  Updated: 10/Dec/14  Resolved: 07/Apr/14

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.4.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Taha Jahangir Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

One of our mongod servers suddenly existed with this backtrace:
mongod 2.4.8 on CentOS 6.5 (2.6.32-431.1.2.0.1.el6.x86_64) with replica-set setup (one another mongod and one arbiter)

Sat Feb 22 18:37:36.067 Invalid access at address: 0x7fdf7fe7e6c0 from thread: conn291963
 
Sat Feb 22 18:37:36.085 Got signal: 7 (Bus error).
 
Sat Feb 22 18:37:36.139 Backtrace:
0xde05e1 0x6d0559 0x6d0ae2 0x7fe9a06bf710 0x90bcbc 0x8658e7 0x868893 0x8699ec 0x8dd50a 0x8de8a3 0x8df5b2 0xa81c00 0xa864cc 0x9fa499 0x9fb9b3 0x6e83a8 0xdccbae 0x7fe9a06b79d1 0x7fe99fa5cb6d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde05e1]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6d0559]
 /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x262) [0x6d0ae2]
 /lib64/libpthread.so.0(+0xf710) [0x7fe9a06bf710]
 /usr/bin/mongod(_ZN5mongo7Helpers7findOneERKNS_10StringDataERKNS_7BSONObjERS4_b+0x5c) [0x90bcbc]
 /usr/bin/mongod(_ZN5mongo16CmdFindAndModify17runNoDirectClientERKSsRKNS_7BSONObjES5_S5_bbbRNS_14BSONObjBuilderERSs+0x107) [0x8658e7]
 /usr/bin/mongod(_ZN5mongo16CmdFindAndModify17runNoDirectClientERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x1013) [0x868893]
 /usr/bin/mongod(_ZN5mongo16CmdFindAndModify3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x9ac) [0x8699ec]
 /usr/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3a) [0x8dd50a]
 /usr/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xf43) [0x8de8a3]
 /usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x5f2) [0x8df5b2]
 /usr/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x40) [0xa81c00]
 /usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0xd7c) [0xa864cc]
 /usr/bin/mongod() [0x9fa499]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x383) [0x9fb9b3]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6e83a8]
 /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdccbae]
 /lib64/libpthread.so.0(+0x79d1) [0x7fe9a06b79d1]
 /lib64/libc.so.6(clone+0x6d) [0x7fe99fa5cb6d]



 Comments   
Comment by Thomas Rueckstiess [ 07/Apr/14 ]

Hi Taha,

I noticed this ticket is still unresolved and we haven't heard back from you for a while. As we don't have enough information to continue the diagnosis, we have to assume that the hard drive failure is the source of the problem. You would have to fix the hardware issue and resync the node from a healthy source. I'll go ahead and close this ticket now. If you would like to follow up feel free to re-open it and provide additional information.

Thanks,
Thomas

Comment by Stennie Steneker (Inactive) [ 17/Mar/14 ]

Hi Taha,

Were you able to run fsck on all the drives? Can you confirm the current status of this issue?

It would also be helpful to run smartctl (part of smartmontools) to check for SMART hardware errors:

sudo  smartctl -a /dev/sdb

Thanks,
Stephen

Comment by Eliot Horowitz (Inactive) [ 10/Mar/14 ]

Can you run an fsck on all the drives?

Comment by Taha Jahangir [ 24/Feb/14 ]

The root mongo data directory (journal + lock file) is on sda, and db directories (`local` db and our db) is on sdb (`directoryperdb` is enabled)

Comment by Eliot Horowitz (Inactive) [ 24/Feb/14 ]

Is sdb where the mongo data lives?
Not sure what those errors are, not ones I can recall seeing, but seem related.

Comment by Taha Jahangir [ 24/Feb/14 ]

What you mean from history? This is a dedicated server runs a mongod instance.

logs in /var/log/messages:

Feb 22 18:37:03 servername kernel: ata4.00: exception Emask 0x0 SAct 0x1c SErr 0x0 action 0x0
Feb 22 18:37:03 servername kernel: ata4.00: irq_stat 0x40000008
Feb 22 18:37:03 servername kernel: ata4.00: failed command: READ FPDMA QUEUED
Feb 22 18:37:03 servername kernel: ata4.00: cmd 60/28:10:30:31:15/00:00:19:00:00/40 tag 2 ncq 20480 in
Feb 22 18:37:03 servername kernel:         res 41/40:28:30:31:15/00:00:19:00:00/00 Emask 0x409 (media error) <F>
Feb 22 18:37:03 servername kernel: ata4.00: status: { DRDY ERR }
Feb 22 18:37:03 servername kernel: ata4.00: error: { UNC }
Feb 22 18:37:03 servername kernel: ata4.00: configured for UDMA/133
...
Feb 22 18:37:18 servername kernel: ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb 22 18:37:18 servername kernel: ata4.00: irq_stat 0x40000008
Feb 22 18:37:18 servername kernel: ata4.00: failed command: READ FPDMA QUEUED
Feb 22 18:37:18 servername kernel: ata4.00: cmd 60/28:00:30:31:15/00:00:19:00:00/40 tag 0 ncq 20480 in
Feb 22 18:37:18 servername kernel:         res 41/40:28:30:31:15/00:00:19:00:00/00 Emask 0x409 (media error) <F>
Feb 22 18:37:18 servername kernel: ata4.00: status: { DRDY ERR }
Feb 22 18:37:18 servername kernel: ata4.00: error: { UNC }
Feb 22 18:37:18 servername kernel: ata4.00: configured for UDMA/133
Feb 22 18:37:18 servername kernel: sd 3:0:0:0: [sdb] Unhandled sense code
Feb 22 18:37:18 servername kernel: sd 3:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 22 18:37:18 servername kernel: sd 3:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
Feb 22 18:37:18 servername kernel: Descriptor sense data with sense descriptors (in hex):
Feb 22 18:37:18 servername kernel:        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 22 18:37:18 servername kernel:        19 15 31 30
Feb 22 18:37:18 servername kernel: sd 3:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
Feb 22 18:37:18 servername kernel: sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 19 15 31 30 00 00 28 00
Feb 22 18:37:18 servername kernel: ata4: EH complete
...
Feb 22 18:37:36 servername kernel: ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Feb 22 18:37:36 servername kernel: ata4.00: irq_stat 0x40000008
Feb 22 18:37:36 servername kernel: ata4.00: failed command: READ FPDMA QUEUED
Feb 22 18:37:36 servername kernel: ata4.00: cmd 60/08:00:30:31:15/00:00:19:00:00/40 tag 0 ncq 4096 in
Feb 22 18:37:36 servername kernel:         res 41/40:08:30:31:15/00:00:19:00:00/00 Emask 0x409 (media error) <F>
Feb 22 18:37:36 servername kernel: ata4.00: status: { DRDY ERR }
Feb 22 18:37:36 servername kernel: ata4.00: error: { UNC }
Feb 22 18:37:36 servername kernel: ata4.00: configured for UDMA/133
Feb 22 18:37:36 servername kernel: sd 3:0:0:0: [sdb] Unhandled sense code
Feb 22 18:37:36 servername kernel: sd 3:0:0:0: [sdb] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Feb 22 18:37:36 servername kernel: sd 3:0:0:0: [sdb] Sense Key : Medium Error [current] [descriptor]
Feb 22 18:37:36 servername kernel: Descriptor sense data with sense descriptors (in hex):
Feb 22 18:37:36 servername kernel:        72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Feb 22 18:37:36 servername kernel:        19 15 31 30
Feb 22 18:37:36 servername kernel: sd 3:0:0:0: [sdb] Add. Sense: Unrecovered read error - auto reallocate failed
Feb 22 18:37:36 servername kernel: sd 3:0:0:0: [sdb] CDB: Read(10): 28 00 19 15 31 30 00 00 08 00
Feb 22 18:37:36 servername kernel: ata4: EH complete

mongo log does not contain anything more related to this errror, the four lines before the error is:

Sat Feb 22 18:37:31.581 [conn380536] end connection 172.20.43.10:18710 (78 connections now open)
Sat Feb 22 18:37:31.663 [initandlisten] connection accepted from 172.20.43.10:25708 #380997 (79 connections now open)
Sat Feb 22 18:37:34.544 [conn380994] end connection 172.20.43.10:39013 (78 connections now open)
Sat Feb 22 18:37:36.056 [conn351853] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1392657306000|1 } } cursorid:3546388934237044429 ntoreturn:0 keyUpdates:0 numYields: 11 locks(micros) r:18189936 nreturned:1203 reslen:20471 9095ms

Comment by Eliot Horowitz (Inactive) [ 24/Feb/14 ]

Hi,

What's the history of this machine?

Can you check /var/log/messages for any i/o errors?

Can you send the full mongod log?

Generated at Thu Feb 08 03:29:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.