Investigate and fix non-retryable error handling

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Critical - P2
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Environment:
      OS:
      node.js / npm versions:
      Additional info:
    • None
    • None
    • Developer Tools, Compass

      The non-retryable error handling we added for compass-web here: https://github.com/mongodb-js/compass/blob/a9ea81e89378d6205ae664f858a23c7135fe3a46/packages/compass-connections/src/stores/connections-store-redux.ts#L1410-L1419

      Seems like it might have stopped working.

      Look at this Sentry trace: https://mongodb-org.sentry.io/dashboards/trace/d240e2adba4a499bb67791de3fad6629/?dashboardId=142094&environment=prod&fov=17642.443549603224%2C60.02953439950943&node=span-3627092fbd124f13&project=4505240668733440&project=4509401391628288&source=dashboards&statsPeriod=3d×tamp=1756854234&widgetId=1150500

      The connection attempt is timing out after 30 seconds, yet we see that every websocket that gets opened is getting closed with "Cluster is not in a valid state" error (see screenshots).

      This maps to close code 1008 (Violated Policy), which we are also seeing in the trace (again, see screenshot)

      This seems to be one of the codes that we are handling in the compass code, and so it should cause the Data Service to get disconnected. But as we see, the driver keeps retrying and retrying, so there is something wrong.

      We don't have e2e tests for this code path, let's add them here.

      The connection in the stack trace never successfully connected, so our listener for the heartbeat failed isn't setup yet, so this error isn't caught and we don't disconnect. We should account for this error while we're fetching the instance information.

      Added the listener in https://github.com/mongodb-js/compass/pull/6598/files#diff-018a2f2c6c0aff913162a6568095ab43fb5d67249a66a9f30ae26c4ea61d3daaR1729 

        1. Screenshot 2025-09-03 at 10.33.08 AM.png
          708 kB
          Simon Zhu
        2. Screenshot 2025-09-03 at 10.36.41 AM.png
          275 kB
          Simon Zhu

            Assignee:
            Jack Weir
            Reporter:
            Simon Zhu
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: