Change Stream stops working after failover

XMLWordPrintableJSON

    • 3
    • 3
    • None
    • Not Needed
    • Hide

      1. What would you like to communicate to the user about this feature?
      2. Would you like the user to see examples of the syntax and/or executable code and its output?
      3. Which versions of the driver/connector does this apply to?

      Show
      1. What would you like to communicate to the user about this feature? 2. Would you like the user to see examples of the syntax and/or executable code and its output? 3. Which versions of the driver/connector does this apply to?
    • None
    • None
    • None
    • None
    • None
    • None

      Problem

      Change Streams stop working after failover, i.e. when a replication cluster's primary member becomes unavailable.

      Setup

      1. Node.js 22
      2. npm install mongodb@6.14.2
      3. MongoDB replica set with three members: A (primary), B (secondary), C (secondary)

      Reproduction

      1. Start the attached script with node main.mjs
      2. Observe how change stream events are printed to the console every second
      3. Stop the process of member A (a rs.stepDown() does not trigger the error, but it can precede stopping the process)
      4. Almost immediately, change stream events stop printing
      5. The application crashes after 60 seconds with a "MongoServerSelectionError" and "ECONNREFUSED 127.0.0.1:27017" (member A)

       

      import { MongoClient } from 'mongodb';
      
      const uri = 'mongodb://127.0.0.1:27017,127.0.0.1:27018,127.0.0.1:27019/test?replicaSet=rs0';
      const client = await MongoClient.connect(uri);
      const testCollection = client.db().collection('Test');
      
      let iteration = 0;
      setInterval(() => testCollection.insertOne({ i: iteration++ }), 1000);
      
      for await (const change of testCollection.watch()) {
        console.log(`${change.operationType}: ${change.fullDocument.i}`);
      }
      

      Expectation

      • In the case of a failover, node-mongodb-native should keep the change stream going without interruption.
      • There should not be a 60 second blackout before the changeover is being noticed.

      Who is impacted

      • This affects all customers using Change Streams.
      • It disrupts their users in case of a failover, e.g. when upgrading MongoDB.

      Ruling out other problem sources

      This is a problem with the node-mongodb-native driver because:

      1. MongoDB itself correctly re-elects a new primary node as can be observed by rs.status() in mongosh.
      2. A reproduction with PyMongo in Python does not show this problem: the change events keep being printed even after member A has been stopped, and even after 60 seconds.
      3. Furthermore, with PyMongo we can restart member A and then stop member B and the events keep being printed.

      Discussion

      Our application keeps running normally for 60 seconds, except that no more change stream events are being published during that time and until the server then later crashes and restarts.

      We were thinking about using resumeAfter, but the 60 seconds blackout doesn't really make this a viable option. Neither the maxAwaitTimeMS nor the serverSelectionTimeoutMS option did have an effect on this timeout.

      -----

      Use Case

      As a... user of the ChangeStream
      I want... network errors to be handled automatically
      So that... I don't notice when primary goes down and new primary is elected

      User Experience

      • See the snippet above

      Dependencies

      • N/A

      Risks/Unknowns

      • N/A

      Acceptance Criteria

      Implementation Requirements

      • handle ServerSelectionError in ChangeStream _processErrorIteratorMode and 
        _processErrorStreamMode and consider it as "resumable error"

      Testing Requirements

      • Integration test to ensure that this error is handled and resume process working with the new primary

      Documentation Requirements

      • N/A

      Follow Up Requirements

      • All "non-server" errors must be resumable by all drivers in prose tests, see DRIVERS-3138

        1. error-log.txt
          7 kB
        2. main.mjs
          0.5 kB
        3. main.py
          0.8 kB

            Assignee:
            Sergey Zelenov
            Reporter:
            Peter Gassner
            None
            Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: