[DOCS-3301] MMS OnPrem Backup: Documentation for Backup Alerts Service Created: 30/Apr/14  Updated: 16/Mar/15  Resolved: 21/May/14

Status: Closed
Project: Documentation
Component/s: Cloud Manager
Affects Version/s: None
Fix Version/s: v1.3.5, mms-1.4

Type: Task Priority: Major - P3
Reporter: Cailin Nelson Assignee: Bob Grabar
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:
Days since reply: 9 years, 24 weeks, 1 day ago

 Description   

MMS OnPrem Backup includes an alerting service that will alert the system administrators when problems with the system are detected. Document the various alerts that may be received, what they mean, and what steps should be taken to resolve them.



 Comments   
Comment by Githook User [ 02/Sep/14 ]

Author:

{u'username': u'bgrabar', u'name': u'Bob Grabar', u'email': u'bob.grabar@10gen.com'}

Message: DOCS-3301 document backup alerts service
Branch: next
https://github.com/10gen/mms-docs/commit/2ccf229872be2003f4907e3e9f38bb468db977b9

Comment by Githook User [ 21/May/14 ]

Author:

{u'username': u'bgrabar', u'name': u'Bob Grabar', u'email': u'bob.grabar@10gen.com'}

Message: DOCS-3301 document backup alerts service
Branch: v1.4
https://github.com/10gen/mms-docs/commit/8f373169e00a9d845206e2fb2c0300ed2298b15e

Comment by Githook User [ 21/May/14 ]

Author:

{u'username': u'bgrabar', u'name': u'Bob Grabar', u'email': u'bob.grabar@10gen.com'}

Message: DOCS-3301 document backup alerts service
Branch: master
https://github.com/10gen/mms-docs/commit/2ccf229872be2003f4907e3e9f38bb468db977b9

Comment by Cailin Nelson [ 12/May/14 ]

Backup Agent Down

If the Backup Agent for any Group with at least one active replica set or cluster is down for more than 1 hour, this alert is triggered.

To resolve:
1. Locate the Group in the MMS interface.
2. Click on Backup -> Backup Agents to see what server the Backup Agent is hosted on.
3. Check the Backup Agent log file on that server.

Backups Broken

If the MMS Backup system detects an inconsistency the Backup state for that replica set will be marked as "broken". To debug:

1. Check the corresponding Backup Agent log. If you see a "Failed Common Points" test, one of the following may have happened.

  • A significant rollback event occurred on the backed up replica set. In this case a resync of the replica set will be required.
  • The oplog collection of the backed up replica set was resized or deleted. In this case a resync of the replica set will be required.

2. Check the corresponding log file by going to Admin : Jobs : SomeJob: Logs

  • An error message explaining the problem should appear. Contact MongoDB Support if you need help interpreting.

Clustershot Failed

If the MMS Backup system cannot successfully take a clustershot for a sharded cluster backup, this alert will be generated. The alert text should contain the reason for the problem. Common problems include:

  • There was no reachable mongos. To resolve, ensure that there is at least one mongos showing in Hosts : Mongos
  • The balancer could not be stopped. To resolve, check the log files for the first config server to determine why the balancer will not stop.
  • Could not insert a token in one or more shards. To resolve, ensure connectivity between the Backup Agent and all shards.

Bind Failure

If a new replica set cannot be bound to a Backup Daemon, this alert will be generated. The alert test should contain a reason for the problem. Common problems include:

  • No primary found. At the time the binding occurred, no primary could be detected by the Monitoring Agent. Ensure that the replica set is healthy.
  • Not enough space available on any Backup Daemon.

To resolve either resolve the issue above and then re-initiate the initial sync. Alternatively, the job can be manually bound by going to Admin : Job Timeline.

Snapshot Behind Snitch

If the latest snapshot for a replica set is significantly behind schedule, this alert will be triggered. Check the job log at Admin : Jobs : JobName : Logs for any obvious errors.

Generated at Thu Feb 08 07:45:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.