Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-100191

Fix recent bad kafka error messages

    • Atlas Streams
    • Fully Compatible
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None

      We still see quite a few prod Kafka error messages that don't have enough information. They usually look like below.

      • $emit to Kafka topic encountered error while connecting: Local: Broker transport failure (-195) and
      • KafkaConsumerOperator failed to get committed offsets: Local: Timed out (-185) 

      We should investigate these error messages and find out how to include more error information in the error response to customers.

      Some of these are likely caused by auth issues, Ryan was able to hit this when he set SASL_PLAINTEXT instead of SASL_SSL. In this ticket, let's focus on this auth scenario and get that working with a good error message.

      In April 2025:

      • Could not connect to the Kafka topic: Local: Broker transport failure (-195), [VPC Peering: false] processStreamProcessor 419 false 6675d26c7c88b838d5c71dfe 67a4c549a0d7753e3b467fb5 67de8877620fc93a8b504f56
      • 08:24:48.536 Could not connect to the Kafka topic: Local: Broker transport failure (-195), [VPC Peering: false] startStreamProcessor 419 false 678db0919f33bf06ef82a24b 67c66113858bd30f92627b51 67e116688547f7e9fe844973
      • Could not connect to the Kafka topic: Local: Broker transport failure (-195), [VPC Peering: false] processStreamProcessor 419 false 6675d26c7c88b838d5c71dfe 67a4c549a0d7753e3b467fb5 67de8877620fc93a8b504f56

      Other examples:

      2025-03-16 17:08:39.728 KafkaConsumerOperator failed to get committed offsets: Local: Timed out (-185) processStreamProcessor 419 false 66ea7bbc3895ee30fbe765b0 67d062c0bf263a1be3586054 67d705767f507ad321523363 1
      2025-03-16 17:09:09.065 KafkaConsumerOperator failed to get committed offsets: Local: Timed out (-185) processStreamProcessor 419 false 66ea7bbc3895ee30fbe765b0 67d062c0bf263a1be3586054 67d70593e33bbdf533c50602 1

       

      https://splunk.corp.mongodb.com/en-US/app/streams/search?earliest=1737651600&latest=1738258301&q=search%20index%3Dmhouse%20source%3Dstreams-spm%20((NOT%20%2265094f059e0776665611598b%22)%20AND%20(NOT%20%226695222fc5479449e1e8d2be%22))%20%22TransitionToFailed%22%20%0A%60%60%60%0AThe%20below%20clause%20filters%20error%20failures%20during%20create%20validation%2C%20which%20are%20frequent%2C%20noisy%2C%20and%20most%20likely%20the%0Auser%27s%20fault.%0A%60%60%60%0AAND%20streamProcessorID!%3D%22000000000000000000000000%22%0A%7C%20stats%20count%20by%20errorMsg%2C%20errorCode%2C%20isInternalError%2C%20tenantID%0A%7C%20sort%20-isInternalError%2C%20count&display.page.search.mode=smart&dispatch.sample_ratio=1&display.page.search.tab=statistics&display.general.type=statistics&sid=1738258313.1348446

            Assignee:
            calvin.nix@mongodb.com Calvin Nix
            Reporter:
            matthew.normyle@mongodb.com Matthew Normyle
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: