Loading...

Type: Task
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Monguard
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Atlas MMS uses the certain logs from Monguard process to provide alerting and monitoring capabilities. They are used for things like AFM / analytics / DW queries / alerts based on specific patterns across Atlas (e.g., “all Fatal Assertion X across M70+ in us-east-1”).

These logs are monitored by Agent-embedded Filebeat spooler. It applies a rule’s query + projection locally, and sends only matching lines to MMS via dedicated log ingestion APIs. MMS persists matches in nds.logIngestion.logs (TTL 7 days), then exports nightly to DW and exposes them via a Private API (keyset-paginated).

Atlas Log Ingestion

Centered on Log Ingestion Rules: each rule defines query, projection, resultsPerHour, logType, sourceClusters constraints, etc.
https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion
Rules are created by Atlas Clusters team via HELP tickets; requesters fill out the “Log Ingestion Rule Request Form”.

Tasks

Based on the mongotune pattern (scope + design + rule forms + AFM integration), these are the concrete steps Monguard should own or co-own:

Define a Monguard log schema for ingestible events

- Enumerate the event types you want AFM / alerts / activity feed to see (at minimum: faults, serious proxy failures, maybe rollout transitions).
  https://docs.google.com/document/d/1MFuf69FT-HRIl84wYGPjP063diCZfxot__XbftjN6n0
- For each, pick a stable numeric id and define the attr payload (e.g., fault_name, source_ip, username, policy name). This is exactly what mongotune did before asking for rules.
  https://docs.google.com/document/d/1ro2k_pUFco5gO_eQ568mmgHeEBmXq8Uz4_ksm5dglB4

Add ingestion/alert flags to Monguard logs (mirroring mongotune)

- For candidate events, emit:
  - attr.shouldIngest: true — “this line should go through log ingestion”.
  - attr.shouldAlert: true|false — whether it should eventually raise an alert vs just an informational event.
  - attr.isDryRun: true|false — if you ever have dry-run behavior you don’t want exposed.
    https://docs.google.com/document/d/1bivkuEv4Dx71SNA4lkRy2hjQIY3lbtY49bW1NMFebM4 https://docs.google.com/document/d/1ro2k_pUFco5gO_eQ568mmgHeEBmXq8Uz4_ksm5dglB4
- Default semantics: if the field is missing, treat it as false (same as mongotune).
  https://docs.google.com/document/d/1bivkuEv4Dx71SNA4lkRy2hjQIY3lbtY49bW1NMFebM4

Ensure all Monguard logs used for ingestion follow server JSON structure

- Keep the standard top-level envelope (t/s/c/id/ctx/msg/attr) for all ingestible Monguard events, as mongotune and mongod do.
  https://docs.google.com/document/d/1bivkuEv4Dx71SNA4lkRy2hjQIY3lbtY49bW1NMFebM4 https://docs.google.com/document/d/1skIr4-DlMUSScruQhM7m0HTrFj9HfEhZK1etK7EkIaw
- Confirm the log file path Monguard writes to (e.g. /srv/mongodb/monguard/monguard.log) is the one the agent’s log-ingestion Filebeat spooler will tail, not just Fluentbit.
  https://docs.google.com/document/d/1sejZy15R7aMs9D070g0FamdE0NKSSey4U4dXofAIQD8 https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion

Write and publish a Monguard log format spec

- Create a short “Monguard Logs Proposal” style doc (like mongotune’s) that lists:
  - Each event type, its id, message, and full attr schema.
  - Which ones are shouldIngest and which are shouldAlert by default.
- This is explicitly called out as a dependency in the mongotune scope (“finalize log format specification with stable log IDs and consistent JSON structure”).
  https://docs.google.com/document/d/1ro2k_pUFco5gO_eQ568mmgHeEBmXq8Uz4_ksm5dglB4 https://docs.google.com/document/d/15hR4w8OX9U3gqjOKm2zhwgkkGp2t4CPtcWfJcIw5SEQ

Identify the initial set of log ingestion rules you want

- Start with a small, high-value set (e.g. all faults with shouldIngest=true, maybe one or two “monguard crash”/“unrecoverable error” rules).

- For each rule, define:
  - Query using only equality predicates on id, c: "MONGUARD", and/or attr.shouldIngest, attr.shouldAlert, attr.fault_name, etc.
    https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion https://docs.google.com/document/d/1h_EKOO8C9eVwnI11bdT4_TjrhnjIV_gKG-XJjeRYAUQ
  - Projection limited to non-PII fields needed downstream (e.g. id, msg, attr.fault_name, attr.source_ip).
    https://docs.google.com/document/d/1skIr4-DlMUSScruQhM7m0HTrFj9HfEhZK1etK7EkIaw https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion
  - Rate limit (resultsPerHour) so you don’t blow up the per-rule cap; mongotune rules typically used values like 100–1000/hr depending on expected volume.

File Log Ingestion Rule Request Forms + HELP tickets

- For each rule:
  - Fill out a copy of the Log Ingestion Rule Request Form with query, projection, examples, rate limit, and useful life.
    https://docs.google.com/document/d/1skIr4-DlMUSScruQhM7m0HTrFj9HfEhZK1etK7EkIaw https://docs.google.com/document/d/1h_EKOO8C9eVwnI11bdT4_TjrhnjIV_gKG-XJjeRYAUQ
  - File an Atlas Help ticket “Atlas Log Ingestion Rule Request: [Rule Name]” with a link to the form; this is exactly how mongotune got its rules (e.g. HELP-89709, HELP-76368).

- Ask ACAD/Fleet Rollout to:
  - Create rules in DEV, QA, and PROD consistently (same pattern as “Mongotune observability (dry run)/(wet run)” rules).

Test end-to-end in lower environments

- In DEV/QA:
  - Trigger Monguard events that should be ingested (e.g. synthetic faults; config errors).
  - Query nds.logIngestion.logs / the DW view by ruleId to confirm the projection is correct and CSI/PII is stripped.
    https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion
- Validate:
  - No mis-matches (rules don’t over-capture).
  - Rate limits are not saturating under realistic load (using log ingestion metrics and AC Log Ingestion Playbook guidance).
    https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion https://wiki.corp.mongodb.com/spaces/MMS/pages/385853446/AC+Log+Ingestion+Playbook

Integrate with AFM and downstream alerting

- Coordinate with AFM owners to:
  - Add the new rule IDs into AFM’s monitoring config so AF tickets are generated from Monguard events, same way they do for mongotune crash rules.
    https://docs.google.com/document/d/1Jq2hw-m1KE_iSzfcq8do87m6IT1PCEnf1E32n4cksZk
- On Atlas side, if there are customer-visible alerts:
  - Define alert types and CAP events, like the mongotune epic did (one shared alert type for mongotune actions).
    https://docs.google.com/document/d/1ro2k_pUFco5gO_eQ568mmgHeEBmXq8Uz4_ksm5dglB4

Capacity / safety work

- With ACAD:
  - Confirm log volume from Monguard rules is within the “low-volume, high-value” expectation for log ingestion.
    https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion https://docs.google.com/document/d/15hR4w8OX9U3gqjOKm2zhwgkkGp2t4CPtcWfJcIw5SEQ
  - Set sane rate limits; create Grafana alerts on 50/90/100% of rate-limit usage for your rules (pattern from the log ingestion playbook).
    https://wiki.corp.mongodb.com/spaces/MMS/pages/385853446/AC+Log+Ingestion+Playbook https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion

Runbook & docs

- Extend the Monguard runbook / observability docs to cover:
  - Which Monguard events are captured via Atlas Log Ingestion vs via Fluentbit→S3.
    https://docs.google.com/document/d/1MFuf69FT-HRIl84wYGPjP063diCZfxot__XbftjN6n0 https://docs.google.com/document/d/1sejZy15R7aMs9D070g0FamdE0NKSSey4U4dXofAIQD8
  - How to query ingested entries (Trino / Data Lake / Private API) and interpret them in incidents.
    https://wiki.corp.mongodb.com/spaces/MMS/pages/499651141/Atlas+Log+Ingestion https://docs.google.com/document/d/1Jq2hw-m1KE_iSzfcq8do87m6IT1PCEnf1E32n4cksZk

Details

Description

Tasks

Attachments

Activity

People

Dates