[SERVER-21191] Mongo node (Primary) failed suddenly Created: 29/Oct/15 Updated: 28/Mar/16 Resolved: 28/Mar/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Admin |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Lucas | Assignee: | Kelsey Schubert |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
I have one MongoDB Cluster (using ReplicaSet) with: 1 Mongo Primary (3.0.5 - 100GB OPLOG) All servers hosted on Ubuntu 14.04 (Trusty). This cluster is used for reading and heavy written (many per second, after all is a datasource). Every write uses REPLICA_ACKNOWLEDGED WriteConcern (w = 2) (to prevent desynchronization). Today, at 11AM my primary node failed and I'm not seeing no reason to this... Here is the event log (with 2 operations beyond the error, one before and (even after abort) one after) (archive first.log - text is too long). After that, my re-initialization script get mongo node only again, and some operations took place smoothly, until this error (archive second.log - text is too long). After that, all subsequent startup attempts failed with the following error (archive third.log - text is too long). After these failures, I tried to start mongo with --repair parameter, but it gave me WT_PANIC too. After that, I upgrade my mongo node to 3.0.7 and start --repair again (which is running so far). I know you (jira contributors) don't have nothing with this, but this error appears to me for the second time(WT_PANIC: WiredTiger library panic). First you said this problem was solved in a custom branch (created with my previous issue) and then attached to the next version. It's hard to trust in mongo for production environments after these errors... Can someone help me find the possible cause of these errors mentioned above? I really need to know the cause and ways to avoid them... Obs.: I attach some pictures of server disk usage. The picture with name 1.png is from the time of first occurrence. The picture with name 2.png is from all day window (my node is repairing since 4PM). |
| Comments |
| Comment by Ramon Fernandez Marina [ 28/Mar/16 ] |
|
lucasoares, we haven't heard back from you for some time, so we're going to close this ticket for the time being. If this is still an issue for you, or if you run into a server bug, please open a new ticket. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. Regards, |
| Comment by Kelsey Schubert [ 04/Mar/16 ] |
|
Hi lucasoares, Sorry this ticket slipped through the cracks, is this still an issue for you? The logs that you have provided indicate that the server encountered data corruption and shutdown to preserve the integrity of the rest of the dataset. It is likely that the data corruption was the result of faulty disk drives or power failures. However, identifying the exact cause can be challenging without a clear reproduction. If you are continuing to experience issues with data corruption we would recommend a thorough integrity check of your disk drives. Kind regards, |
| Comment by Lucas [ 30/Oct/15 ] |
|
About the log, exactly!! There is no more log info (except a lot of updates, insertions and finds) in first log. I searched for the last two thousand lines and nothing (log file have 15G =/) About the validate command, no, I don't.. I will wait for repair (or if fails, rsync), to get a replica with at least 2 nodes again. But the problem is: why my server shutdown for the first time? (first.log) My data in mongo are separated into many collections (300+). As would be the validation in this case? |
| Comment by Ramon Fernandez Marina [ 29/Oct/15 ] |
|
lucasoares, I looked at other issues reported by you and I found Did you run ever run validate() on your data after the unclean shutdown with 3.0.2/3.0.4? Thanks, |
| Comment by Ramon Fernandez Marina [ 29/Oct/15 ] |
|
lucasoares, unfortunately the log snippet uploaded in first.log does not contain enough information, could you please upload the full logs for this node so we can investigate what happened? I took a quick look at second.log and third.log files, and I believe the messages in them can be explained by the failure in first.log, so I think we should look at a full log for that node first. Thanks, |