[SERVER-10065] Simultaneous ReplicaSet Failure Possibly Due to MapReduce Created: 30/Jun/13 Updated: 11/Jul/16 Resolved: 29/Jul/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | JavaScript |
| Affects Version/s: | 2.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Adam Kirkton | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Both servers Windows Server 2008 R2 |
||
| Attachments: |
|
| Operating System: | Windows |
| Participants: |
| Description |
|
I'm not entirely sure what happens. I am including the relevant logs and minidump files that were generated in hopes that you can extrapolate what was the cause. There were a set of simultaneous exceptions that caused both servers in the replica set to fail. I can provide more background information as is helpful. |
| Comments |
| Comment by Stennie Steneker (Inactive) [ 29/Jul/13 ] |
|
Hi Adam, Given your Map/Reduce jobs have been successfully running for about six months without issue, it seems unlikely that they would be related to the recent memory access violations. In particular, the EXCEPTION_IN_PAGE_ERROR as noted by Tad earlier usually indicates a hardware issue. It sounds like the errors are affecting all the data nodes in your replica set? With your case of potential data storage problems you should ideally try to Resync from a known good member of a replica set rather than repairing a node and potentially introducing some inconsistency. If you haven't done a repair yet, I would suggest first running db.collection.validate(true) on the collection(s) used by your Map/Reduce jobs. This will report if there are obvious data structure errors in the collections. It's worth noting that a validate(true) can be resource intensive as it has to scan the collection and objects; you would ideally want to run this during an offpeak or maintenance window. Per your suggestion, I'm going to close this issue as there does not appear to be anything further to investigate at this time. Regards, |
| Comment by Adam Kirkton [ 15/Jul/13 ] |
|
Thanks for the information. There isn't anything in the event log unfortunately other than the services restarting. I am planning on upgrading to 2.4.5 soon and running the reports again. I will also look into doing a repair run on my db in case there is some corruption. Would sharing the map reduce logic I have help at all with regards to the access violations? These particular map/reduce processes had been running for about six months once an hour before they just started dying. So it's probably related to the other problem, but there may not be a good way to tell. If there's nothing else, you can go ahead and close the ticket. |
| Comment by Tad Marshall [ 14/Jul/13 ] |
|
The stack traces in the logs show access violations while calling mongo::Scope::loadStored() and EXCEPTION_IN_PAGE_ERROR while calling mongo::PageFaultException::touch(). Access violations are caused by referencing memory that is not mapped for the reference type (e.g. read or write) and usually indicate a program logic error. EXCEPTION_IN_PAGE_ERROR (exception 0xC0000006) is caused by a failure to load a properly referenced memory location, and usually indicates a hardware issue; a disk failure if reading directly from a local disk drive or a network error if reading over the network. Can you check the Windows event log to see if there is a hardware event recorded that matches the time of the failures? The EXCEPTION_IN_PAGE_ERROR (exception 0xC0000006) failures happened on db01 at: The access violations suggest possible corruption in the database, at least as seen by the running mongod.exe; this may be related to (e.g. caused by) the hardware failure that produced the EXCEPTION_IN_PAGE_ERRORs. |
| Comment by Adam Kirkton [ 30/Jun/13 ] |
|
Second database server mini dump file |
| Comment by Adam Kirkton [ 30/Jun/13 ] |
|
Second database server log file |
| Comment by Adam Kirkton [ 30/Jun/13 ] |
|
Dump file from server 1. For restart that happened at 10:41 PM EDT based on file date. |
| Comment by Adam Kirkton [ 30/Jun/13 ] |
|
Log file from server 1 |