[DOCS-3301] MMS OnPrem Backup: Documentation for Backup Alerts Service Created: 30/Apr/14 Updated: 16/Mar/15 Resolved: 21/May/14 |
|
| Status: | Closed |
| Project: | Documentation |
| Component/s: | Cloud Manager |
| Affects Version/s: | None |
| Fix Version/s: | v1.3.5, mms-1.4 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Cailin Nelson | Assignee: | Bob Grabar |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Participants: | |
| Days since reply: | 9 years, 24 weeks, 1 day ago |
| Description |
|
MMS OnPrem Backup includes an alerting service that will alert the system administrators when problems with the system are detected. Document the various alerts that may be received, what they mean, and what steps should be taken to resolve them. |
| Comments |
| Comment by Githook User [ 02/Sep/14 ] |
|
Author: {u'username': u'bgrabar', u'name': u'Bob Grabar', u'email': u'bob.grabar@10gen.com'}Message: |
| Comment by Githook User [ 21/May/14 ] |
|
Author: {u'username': u'bgrabar', u'name': u'Bob Grabar', u'email': u'bob.grabar@10gen.com'}Message: |
| Comment by Githook User [ 21/May/14 ] |
|
Author: {u'username': u'bgrabar', u'name': u'Bob Grabar', u'email': u'bob.grabar@10gen.com'}Message: |
| Comment by Cailin Nelson [ 12/May/14 ] |
Backup Agent DownIf the Backup Agent for any Group with at least one active replica set or cluster is down for more than 1 hour, this alert is triggered. To resolve: Backups BrokenIf the MMS Backup system detects an inconsistency the Backup state for that replica set will be marked as "broken". To debug: 1. Check the corresponding Backup Agent log. If you see a "Failed Common Points" test, one of the following may have happened.
2. Check the corresponding log file by going to Admin : Jobs : SomeJob: Logs
Clustershot FailedIf the MMS Backup system cannot successfully take a clustershot for a sharded cluster backup, this alert will be generated. The alert text should contain the reason for the problem. Common problems include:
Bind FailureIf a new replica set cannot be bound to a Backup Daemon, this alert will be generated. The alert test should contain a reason for the problem. Common problems include:
To resolve either resolve the issue above and then re-initiate the initial sync. Alternatively, the job can be manually bound by going to Admin : Job Timeline. Snapshot Behind SnitchIf the latest snapshot for a replica set is significantly behind schedule, this alert will be triggered. Check the job log at Admin : Jobs : JobName : Logs for any obvious errors. |