Investigate changes in SPM-4498: Extend ChangeStream Metrics

XMLWordPrintableJSON

    • Type: Investigation
    • Resolution: Declined
    • Priority: Major - P3
    • No version
    • Affects Version/s: None
    • Component/s: None
    • None
    • Not Needed
    • Developer Tools

      Original Downstream Change Summary

      This project adds:

      • new metrics to the serverStatus command response
      • extends currentOp command response
      • new option to aggregate and getMore commands
      • extends aggregate and getMore command responses (opt-in when the new option is used)
      • extends query stats

        Description of Linked Ticket

        Epic Summary

        Summary

      This project will  provide additional metrics for Change Streams offering our customers and support engineers the necessary observability to troubleshoot issues related to change streams, as well as monitor their resource usage and predict possible errors and outages. The current change streams observability is very limited and has been a subject of numerous discussions during support ticket processing and on the #change-streams channel. The necessary metrics will be added to the serverStatus and currentOp command outputs and possibly to the aggregate and getMore command outputs. We’ll also evaluate exposing them through Open Telemetry. 

      This is a critical correctness and enterprise-readiness initiative required to support high-value internal customers (Atlas Search, Atlas Stream Processing) and external enterprise users, as mentioned in the approved project idea document.

      Motivation

      Many enterprise customers depend on the Change Streams feature for real-time change event processing. However, the current lack of observability creates significant business risks:

      • Strategic Alignment: Enabling AI Transformation (Atlas Search): Atlas Search is the cornerstone of our AI strategy (powering Vector Search and RAG applications). Atlas Search relies on change streams to keep the search and vector indexes in sync with MongoDB's data set. To stay competitive, search must ingest data at massive scale. For instance, currently, it is difficult and labor-intensive to investigate why Atlas Search indexing is not performing as expected.
      • High Support Costs & TTR: When a change stream lags, or disconnects, our technical support team has no immediate way to determine if the issue is a slow client, network latency, or server-side resource contention. This extends Time-To-Resolution (TTR) and wastes engineering hours on deep-dive debugging.
      • Capacity Planning Guesswork: Customers (like Atlas Search) cannot effectively scale their usage because they cannot measure the resource cost (CPU/IO) of opening additional streams. This limits their adoption of the feature.
      • Silent Failures: Critical applications (like Atlas Stream Processing) run the risk of 'falling off the oplog' (unable to catch up) without any warning. Currently, there is no metric to alert a user before this catastrophic data loss state occurs.

      Please refer to the supporting customer cases in the linked project idea document and the related feature request in Aha!.

      Documentation

      Project Proposal

      Docs Update
      Syntax
      Technical Design
      Scope

            Assignee:
            Unassigned
            Reporter:
            Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: