[SERVER-6455] Invalid access at address: <hexaddr>; /usr/local/mongo/bin/mongod(_ZN5mongo15BSONObjIterator4nextEv+0x27) [0x51a6c7] Created: 15/Jul/12 Updated: 15/Aug/12 Resolved: 25/Jul/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 2.0.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bobby Joe | Assignee: | Matt Dannenberg |
| Resolution: | Done | Votes: | 0 |
| Labels: | crash | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux <host> 2.6.18-238.12.1.el5 #1 SMP Sat May 7 20:18:50 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux processor : 0 ^^ x12 |
||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
A RS-secondary (member of a 4-shard cluster) segfaulted abruptly during "normal" operation. At the time of the failure, nothing should have been querying the secondary – the live site doesn't do RS reads, and some batch iteration jobs run against the secondaries but at a different time. See first backtrace. Removed lockfile and restarted, crashed again. See second backtrace (is removing the lockfile no longer recommended procedure? If so, oops) Left lockfile alone and the secondary returned to normal operation. Let me know what other information would help. This isn't super high-priority for me since the system returned to normal relatively quickly, but it is somewhat troublesome. |
| Comments |
| Comment by Mathias Stearn [ 25/Jul/12 ] |
|
Please reopen if this occurs again |
| Comment by Bobby Joe [ 16/Jul/12 ] |
|
OK, I'll grab the oplog if/when this recurs. |
| Comment by Matt Dannenberg [ 16/Jul/12 ] |
|
Can't seem to reproduce this. If it happens again, check rs.status() to get the opTime of the most recent entry synced to that node and dump the oplog from that point forward. Hopefully we can figure out something useful from that. |
| Comment by Bobby Joe [ 15/Jul/12 ] |
|
Interestingly the node went down at 3 milliseconds after exactly 00:00 local time (CST) |
| Comment by Bobby Joe [ 15/Jul/12 ] |
|
more logs from this node node was down for 3 hours in between crash and first restart. MMS confirms this actually (i think, if i'm reading the graph correctly) |
| Comment by Scott Hernandez (Inactive) [ 15/Jul/12 ] |
|
Just that node please. |
| Comment by Bobby Joe [ 15/Jul/12 ] |
|
Nope, it's not stuck – it's back in service. It wasn't running with any more -vvvs unfortunately. Would you like the logs from just that machine, from the RS pair, or from the entire cluster? (wow, that was fast) |
| Comment by Scott Hernandez (Inactive) [ 15/Jul/12 ] |
|
Is it stuck now in that state? It sounds like it isn't, but if it is, can you restart with without --quiet and with higher logging levels (--vvvvv) and then upload the logs? Can you upload the full logs from before the event, an hour or two, and through all restarts? |
| Comment by Bobby Joe [ 15/Jul/12 ] |
|
Different from https://jira.mongodb.org/browse/SERVER-5438 since we run no MR jobs. Different from https://jira.mongodb.org/browse/SERVER-6012 since I can't reproduce it with a specific query (i.e., there wasn't much going on on this instance when the crash occurred) |