[SERVER-23027] Unrecovering replication delay and crashing of server Created: 09/Mar/16 Updated: 06/May/16 Resolved: 06/May/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.2.3 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Varun Vijayaraghavan | Assignee: | Kelsey Schubert |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Participants: |
| Description |
|
We've been having a set of intermittent issues while performing some upgrades to our cluster. I have reproduced it so not filing as a bug yet. We're a) converting a standalone mongo instance into a replica set in phases, b) upgrading the to bigger AWS instances with higher disk IOPS, and c) using mongo 3.2.3 for the new instances (the initial standalone instance is at 3.0.8). There are 5 instances in total which include the old primary, 3 new secondaries and 1 arbiter. They are all running on WiredTiger. There are some properties of the cluster that are worth noting.
We're noticing that SECONDARIES are getting into a state of increasing replication delay.[see server-status-slow file for logs around this time] And after several hours of replication delays - one of the secondaries simply crashed. ~Around this time, we were performing fairly heavy writes on the PRIMARY. The disk read "IOPS" on the primary as reported by AWS was 1000 IOPS, with the max being 1500 IOPS. And writes were at a ~500 IOPS. In one case (ip-10-0-0-233), the "fixed" the replication delay by restarting the server. The replication delay immediately dropped to 0. [see replication-delay-drop image] In another secondary, restarting did not fix the replication delay, it was not able to find a server from which it could replicate safely. The log message contained We were never able to recover the crashed secondary. Every restart of the server resulted in it crashing again with a message that looked like this: This is impeding important operational tasks we need to do, so we'd really like some insight as to what could have caused this. Let me know if there is any other information I can provide that would be useful. Unfortunately, I don't have the logs for the crashed mongo instance. I can attach logs for the other instance. That said the same issue happened a few days ago on another instance and if necessary I might be able to dig that up. |
| Comments |
| Comment by Kelsey Schubert [ 06/May/16 ] | |
|
Hi varun@x.ai, We haven’t heard back from you for some time, so I’m Regards, | |
| Comment by Kelsey Schubert [ 11/Apr/16 ] | |
|
Hi varun@x.ai, We still need additional information to diagnose the problem. If this is still an issue for you, can you please upload the diagnostic.data and logs for the affected nodes? Thank you, | |
| Comment by Kelsey Schubert [ 18/Mar/16 ] | |
|
Hi varun@x.ai, Sorry for the delay getting back to you. The behavior that you are describing is actually the result of two different issues. First, let us discuss the secondary which has the error message:
This error message indicates that a document on the secondary has suffered disk corruption. Determining the exact cause of this corruption is generally not worthwhile. However, if data corruption issues persist, I would recommend a thorough integrity check of the affected node's disk drives. To address this issue, please execute a clean resync on the affected node. Second, I would like to discuss the growing replication delay until server restart. To continue to investigate this issue, please answer the following questions:
Please also consider upgrading to 3.2.4 which contains fixes including Kind regards, |