-
Type:
Task
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Monguard
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Atlas MMS uses the certain logs from Monguard process to provide alerting and monitoring capabilities. They are used for things like AFM / analytics / DW queries / alerts based on specific patterns across Atlas (e.g., “all Fatal Assertion X across M70+ in us-east-1”).
These logs are monitored by Agent-embedded Filebeat spooler. It applies a rule’s query + projection locally, and sends only matching lines to MMS via dedicated log ingestion APIs. MMS persists matches in nds.logIngestion.logs (TTL 7 days), then exports nightly to DW and exposes them via a Private API (keyset-paginated).
Atlas Log Ingestion
- Centered on Log Ingestion Rules: each rule defines query, projection, resultsPerHour, logType, sourceClusters constraints, etc.
https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion - Rules are created by Atlas Clusters team via HELP tickets; requesters fill out the “Log Ingestion Rule Request Form”.
Tasks
Based on the mongotune pattern (scope + design + rule forms + AFM integration), these are the concrete steps Monguard should own or co-own:
- Define a Monguard log schema for ingestible events
-
- Enumerate the event types you want AFM / alerts / activity feed to see (at minimum: faults, serious proxy failures, maybe rollout transitions).
https://docs.google.com/document/d/1MFuf69FT-HRIl84wYGPjP063diCZfxot__XbftjN6n0 - For each, pick a stable numeric id and define the attr payload (e.g., fault_name, source_ip, username, policy name). This is exactly what mongotune did before asking for rules.
https://docs.google.com/document/d/1ro2k_pUFco5gO_eQ568mmgHeEBmXq8Uz4_ksm5dglB4
- Enumerate the event types you want AFM / alerts / activity feed to see (at minimum: faults, serious proxy failures, maybe rollout transitions).
- Add ingestion/alert flags to Monguard logs (mirroring mongotune)
-
- For candidate events, emit:
- attr.shouldIngest: true — “this line should go through log ingestion”.
- attr.shouldAlert: true|false — whether it should eventually raise an alert vs just an informational event.
- attr.isDryRun: true|false — if you ever have dry-run behavior you don’t want exposed.
https://docs.google.com/document/d/1bivkuEv4Dx71SNA4lkRy2hjQIY3lbtY49bW1NMFebM4 https://docs.google.com/document/d/1ro2k_pUFco5gO_eQ568mmgHeEBmXq8Uz4_ksm5dglB4
- Default semantics: if the field is missing, treat it as false (same as mongotune).
https://docs.google.com/document/d/1bivkuEv4Dx71SNA4lkRy2hjQIY3lbtY49bW1NMFebM4
- For candidate events, emit:
- Ensure all Monguard logs used for ingestion follow server JSON structure
-
- Keep the standard top-level envelope (t/s/c/id/ctx/msg/attr) for all ingestible Monguard events, as mongotune and mongod do.
https://docs.google.com/document/d/1bivkuEv4Dx71SNA4lkRy2hjQIY3lbtY49bW1NMFebM4 https://docs.google.com/document/d/1skIr4-DlMUSScruQhM7m0HTrFj9HfEhZK1etK7EkIaw - Confirm the log file path Monguard writes to (e.g. /srv/mongodb/monguard/monguard.log) is the one the agent’s log-ingestion Filebeat spooler will tail, not just Fluentbit.
https://docs.google.com/document/d/1sejZy15R7aMs9D070g0FamdE0NKSSey4U4dXofAIQD8 https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion
- Keep the standard top-level envelope (t/s/c/id/ctx/msg/attr) for all ingestible Monguard events, as mongotune and mongod do.
- Write and publish a Monguard log format spec
-
- Create a short “Monguard Logs Proposal” style doc (like mongotune’s) that lists:
- Each event type, its id, message, and full attr schema.
- Which ones are shouldIngest and which are shouldAlert by default.
- This is explicitly called out as a dependency in the mongotune scope (“finalize log format specification with stable log IDs and consistent JSON structure”).
https://docs.google.com/document/d/1ro2k_pUFco5gO_eQ568mmgHeEBmXq8Uz4_ksm5dglB4 https://docs.google.com/document/d/15hR4w8OX9U3gqjOKm2zhwgkkGp2t4CPtcWfJcIw5SEQ
- Create a short “Monguard Logs Proposal” style doc (like mongotune’s) that lists:
- Identify the initial set of log ingestion rules you want
-
- Start with a small, high-value set (e.g. all faults with shouldIngest=true, maybe one or two “monguard crash”/“unrecoverable error” rules).
-
- For each rule, define:
- Query using only equality predicates on id, c: "MONGUARD", and/or attr.shouldIngest, attr.shouldAlert, attr.fault_name, etc.
https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion https://docs.google.com/document/d/1h_EKOO8C9eVwnI11bdT4_TjrhnjIV_gKG-XJjeRYAUQ - Projection limited to non-PII fields needed downstream (e.g. id, msg, attr.fault_name, attr.source_ip).
https://docs.google.com/document/d/1skIr4-DlMUSScruQhM7m0HTrFj9HfEhZK1etK7EkIaw https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion - Rate limit (resultsPerHour) so you don’t blow up the per-rule cap; mongotune rules typically used values like 100–1000/hr depending on expected volume.
- Query using only equality predicates on id, c: "MONGUARD", and/or attr.shouldIngest, attr.shouldAlert, attr.fault_name, etc.
- For each rule, define:
- File Log Ingestion Rule Request Forms + HELP tickets
-
- For each rule:
- Fill out a copy of the Log Ingestion Rule Request Form with query, projection, examples, rate limit, and useful life.
https://docs.google.com/document/d/1skIr4-DlMUSScruQhM7m0HTrFj9HfEhZK1etK7EkIaw https://docs.google.com/document/d/1h_EKOO8C9eVwnI11bdT4_TjrhnjIV_gKG-XJjeRYAUQ - File an Atlas Help ticket “Atlas Log Ingestion Rule Request: [Rule Name]” with a link to the form; this is exactly how mongotune got its rules (e.g. HELP-89709, HELP-76368).
- Fill out a copy of the Log Ingestion Rule Request Form with query, projection, examples, rate limit, and useful life.
- For each rule:
-
- Ask ACAD/Fleet Rollout to:
- Create rules in DEV, QA, and PROD consistently (same pattern as “Mongotune observability (dry run)/(wet run)” rules).
- Ask ACAD/Fleet Rollout to:
- Test end-to-end in lower environments
-
- In DEV/QA:
- Trigger Monguard events that should be ingested (e.g. synthetic faults; config errors).
- Query nds.logIngestion.logs / the DW view by ruleId to confirm the projection is correct and CSI/PII is stripped.
https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion
- Validate:
- No mis-matches (rules don’t over-capture).
- Rate limits are not saturating under realistic load (using log ingestion metrics and AC Log Ingestion Playbook guidance).
https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion https://wiki.corp.mongodb.com/spaces/MMS/pages/385853446/AC+Log+Ingestion+Playbook
- In DEV/QA:
- Integrate with AFM and downstream alerting
-
- Coordinate with AFM owners to:
- Add the new rule IDs into AFM’s monitoring config so AF tickets are generated from Monguard events, same way they do for mongotune crash rules.
https://docs.google.com/document/d/1Jq2hw-m1KE_iSzfcq8do87m6IT1PCEnf1E32n4cksZk
- Add the new rule IDs into AFM’s monitoring config so AF tickets are generated from Monguard events, same way they do for mongotune crash rules.
- On Atlas side, if there are customer-visible alerts:
- Define alert types and CAP events, like the mongotune epic did (one shared alert type for mongotune actions).
https://docs.google.com/document/d/1ro2k_pUFco5gO_eQ568mmgHeEBmXq8Uz4_ksm5dglB4
- Define alert types and CAP events, like the mongotune epic did (one shared alert type for mongotune actions).
- Coordinate with AFM owners to:
- Capacity / safety work
-
- With ACAD:
- Confirm log volume from Monguard rules is within the “low-volume, high-value” expectation for log ingestion.
https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion https://docs.google.com/document/d/15hR4w8OX9U3gqjOKm2zhwgkkGp2t4CPtcWfJcIw5SEQ - Set sane rate limits; create Grafana alerts on 50/90/100% of rate-limit usage for your rules (pattern from the log ingestion playbook).
https://wiki.corp.mongodb.com/spaces/MMS/pages/385853446/AC+Log+Ingestion+Playbook https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion
- Confirm log volume from Monguard rules is within the “low-volume, high-value” expectation for log ingestion.
- With ACAD:
- Runbook & docs
-
- Extend the Monguard runbook / observability docs to cover:
- Which Monguard events are captured via Atlas Log Ingestion vs via Fluentbit→S3.
https://docs.google.com/document/d/1MFuf69FT-HRIl84wYGPjP063diCZfxot__XbftjN6n0 https://docs.google.com/document/d/1sejZy15R7aMs9D070g0FamdE0NKSSey4U4dXofAIQD8 - How to query ingested entries (Trino / Data Lake / Private API) and interpret them in incidents.
https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion https://docs.google.com/document/d/1Jq2hw-m1KE_iSzfcq8do87m6IT1PCEnf1E32n4cksZk
- Which Monguard events are captured via Atlas Log Ingestion vs via Fluentbit→S3.
- Extend the Monguard runbook / observability docs to cover: