[SERVER-10065] Simultaneous ReplicaSet Failure Possibly Due to MapReduce Created: 30/Jun/13  Updated: 11/Jul/16  Resolved: 29/Jul/13

Status: Closed
Project: Core Server
Component/s: JavaScript
Affects Version/s: 2.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Adam Kirkton Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Both servers Windows Server 2008 R2
4 GB of RAM
Xeon X3450 2.67GHz


Attachments: File db01.dmp     Text File db01.txt     File db02.dmp     Text File db02.txt    
Operating System: Windows
Participants:

 Description   

I'm not entirely sure what happens. I am including the relevant logs and minidump files that were generated in hopes that you can extrapolate what was the cause. There were a set of simultaneous exceptions that caused both servers in the replica set to fail.

I can provide more background information as is helpful.



 Comments   
Comment by Stennie Steneker (Inactive) [ 29/Jul/13 ]

Hi Adam,

Given your Map/Reduce jobs have been successfully running for about six months without issue, it seems unlikely that they would be related to the recent memory access violations. In particular, the EXCEPTION_IN_PAGE_ERROR as noted by Tad earlier usually indicates a hardware issue.

It sounds like the errors are affecting all the data nodes in your replica set? With your case of potential data storage problems you should ideally try to Resync from a known good member of a replica set rather than repairing a node and potentially introducing some inconsistency.

If you haven't done a repair yet, I would suggest first running db.collection.validate(true) on the collection(s) used by your Map/Reduce jobs. This will report if there are obvious data structure errors in the collections. It's worth noting that a validate(true) can be resource intensive as it has to scan the collection and objects; you would ideally want to run this during an offpeak or maintenance window.

Per your suggestion, I'm going to close this issue as there does not appear to be anything further to investigate at this time.

Regards,
Stephen

Comment by Adam Kirkton [ 15/Jul/13 ]

Thanks for the information. There isn't anything in the event log unfortunately other than the services restarting. I am planning on upgrading to 2.4.5 soon and running the reports again. I will also look into doing a repair run on my db in case there is some corruption. Would sharing the map reduce logic I have help at all with regards to the access violations? These particular map/reduce processes had been running for about six months once an hour before they just started dying. So it's probably related to the other problem, but there may not be a good way to tell. If there's nothing else, you can go ahead and close the ticket.

Comment by Tad Marshall [ 14/Jul/13 ]

The stack traces in the logs show access violations while calling mongo::Scope::loadStored() and EXCEPTION_IN_PAGE_ERROR while calling mongo::PageFaultException::touch().

Access violations are caused by referencing memory that is not mapped for the reference type (e.g. read or write) and usually indicate a program logic error.

EXCEPTION_IN_PAGE_ERROR (exception 0xC0000006) is caused by a failure to load a properly referenced memory location, and usually indicates a hardware issue; a disk failure if reading directly from a local disk drive or a network error if reading over the network.

Can you check the Windows event log to see if there is a hardware event recorded that matches the time of the failures?

The EXCEPTION_IN_PAGE_ERROR (exception 0xC0000006) failures happened on db01 at:
Sat Jun 29 22:21:59.398 [conn173662] *** unhandled exception 0xC0000006 at 0x000000013FD302D4, terminating
Sat Jun 29 22:41:33.704 [conn23] *** unhandled exception 0xC0000006 at 0x00000001400C02D4, terminating

The access violations suggest possible corruption in the database, at least as seen by the running mongod.exe; this may be related to (e.g. caused by) the hardware failure that produced the EXCEPTION_IN_PAGE_ERRORs.

Comment by Adam Kirkton [ 30/Jun/13 ]

Second database server mini dump file

Comment by Adam Kirkton [ 30/Jun/13 ]

Second database server log file

Comment by Adam Kirkton [ 30/Jun/13 ]

Dump file from server 1. For restart that happened at 10:41 PM EDT based on file date.

Comment by Adam Kirkton [ 30/Jun/13 ]

Log file from server 1

Generated at Thu Feb 08 03:22:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.