Investigate and implement non-retryable error handling

XMLWordPrintableJSON

    • Type: Story
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Environment:
      OS:
      node.js / npm versions:
      Additional info:
    • None
    • None
    • Developer Tools, Compass

      Use Case

      As a Data Explorer Engineer
      I want the connection attempt to fail fast when encountering non-retryable errors during the initial connection process
      So that I do not spend resources on automatic reconnect attempts by the client

      User Experience

      • When a connection encounters a non-retryable error (like "Cluster is not in a valid state" with close code 1008) during the initial connection attempt, the connection should fail immediately and close the MongoClient to prevent the driver's continuous monitoring and retry attempts.

      Risks/Unknowns

      • The non-retryable error handling added in compass-web may have stopped working or never fully covered the initial connection phase
      • The heartbeat failed listener isn't set up until after successful connection, so errors during initial connection aren't caught
      • Need to ensure we don't break other connection scenarios while implementing fail-fast behavior
      • We are going to hard code a list of codes in the client that needs to remain in sync with server implementation
      • Performance impact should be negligible

      Acceptance Criteria

      Implementation Requirements

      • Add error handling for non-retryable errors (close code 1008 and others) during the initial connection process, close the MongoClient
        • Follow patterns from devtools shared connect package for handling fail-fast during connection process

      Testing Requirements

      • Add an int/unit mocking test to devtools connect that handles the error we expect to be thrown from the tls polyfill in the scenarios we're trying to handle
      • Add new synthetic test that covers the e2e error toast behavior we expect from the improvement. We should expect to see the error much sooner than 30seconds (serverSelectionTimeoutMS)

      References

        1. Screenshot 2025-09-03 at 10.33.08 AM.png
          708 kB
          Simon Zhu
        2. Screenshot 2025-09-03 at 10.36.41 AM.png
          275 kB
          Simon Zhu

            Assignee:
            Neal Beeken
            Reporter:
            Simon Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: