[SERVER-6455] Invalid access at address: <hexaddr>; /usr/local/mongo/bin/mongod(_ZN5mongo15BSONObjIterator4nextEv+0x27) [0x51a6c7] Created: 15/Jul/12  Updated: 15/Aug/12  Resolved: 25/Jul/12

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 2.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bobby Joe Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: crash
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux <host> 2.6.18-238.12.1.el5 #1 SMP Sat May 7 20:18:50 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 44
model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
stepping : 2
cpu MHz : 1596.000
cache size : 12288 KB
physical id : 1
siblings : 6
core id : 0
cpu cores : 6
apicid : 32
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx pdpe1gb rdtscp lm constant_tsc ida nonstop_tsc arat pni monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr sse4_1 sse4_2 popcnt lahf_lm
bogomips : 4788.14
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: [8]

^^ x12


Attachments: Text File traces-more.txt     Text File traces.txt    
Operating System: ALL
Participants:

 Description   

A RS-secondary (member of a 4-shard cluster) segfaulted abruptly during "normal" operation. At the time of the failure, nothing should have been querying the secondary – the live site doesn't do RS reads, and some batch iteration jobs run against the secondaries but at a different time. See first backtrace.

Removed lockfile and restarted, crashed again. See second backtrace (is removing the lockfile no longer recommended procedure? If so, oops)

Left lockfile alone and the secondary returned to normal operation.

Let me know what other information would help. This isn't super high-priority for me since the system returned to normal relatively quickly, but it is somewhat troublesome.



 Comments   
Comment by Mathias Stearn [ 25/Jul/12 ]

Please reopen if this occurs again

Comment by Bobby Joe [ 16/Jul/12 ]

OK, I'll grab the oplog if/when this recurs.

Comment by Matt Dannenberg [ 16/Jul/12 ]

Can't seem to reproduce this. If it happens again, check rs.status() to get the opTime of the most recent entry synced to that node and dump the oplog from that point forward. Hopefully we can figure out something useful from that.

Comment by Bobby Joe [ 15/Jul/12 ]

Interestingly the node went down at 3 milliseconds after exactly 00:00 local time (CST)

Comment by Bobby Joe [ 15/Jul/12 ]

more logs from this node

node was down for 3 hours in between crash and first restart. MMS confirms this actually (i think, if i'm reading the graph correctly)

Comment by Scott Hernandez (Inactive) [ 15/Jul/12 ]

Just that node please.

Comment by Bobby Joe [ 15/Jul/12 ]

Nope, it's not stuck – it's back in service. It wasn't running with any more -vvvs unfortunately.

Would you like the logs from just that machine, from the RS pair, or from the entire cluster?

(wow, that was fast)

Comment by Scott Hernandez (Inactive) [ 15/Jul/12 ]

Is it stuck now in that state? It sounds like it isn't, but if it is, can you restart with without --quiet and with higher logging levels (--vvvvv) and then upload the logs?

Can you upload the full logs from before the event, an hour or two, and through all restarts?

Comment by Bobby Joe [ 15/Jul/12 ]

Different from https://jira.mongodb.org/browse/SERVER-5438 since we run no MR jobs.

Different from https://jira.mongodb.org/browse/SERVER-6012 since I can't reproduce it with a specific query (i.e., there wasn't much going on on this instance when the crash occurred)

Generated at Thu Feb 08 03:11:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.