-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Major - P3
-
None
-
Affects Version/s: 6.14.2
-
Component/s: Change Streams
Problem
Change Streams stop working after failover, i.e. when a replication cluster's primary member becomes unavailable.
Setup
- Node.js 22
- npm install mongodb@6.14.2
- MongoDB replica set with three members: A (primary), B (secondary), C (secondary)
Reproduction
- Start the attached script with node main.mjs
- Observe how change stream events are printed to the console every second
- Stop the process of member A (a rs.stepDown() does not trigger the error, but it can precede stopping the process)
- Almost immediately, change stream events stop printing
- The application crashes after 60 seconds with a "MongoServerSelectionError" and "ECONNREFUSED 127.0.0.1:27017" (member A)
import { MongoClient } from 'mongodb'; const uri = 'mongodb://127.0.0.1:27017,127.0.0.1:27018,127.0.0.1:27019/test?replicaSet=rs0'; const client = await MongoClient.connect(uri); const testCollection = client.db().collection('Test'); let iteration = 0; setInterval(() => testCollection.insertOne({ i: iteration++ }), 1000); for await (const change of testCollection.watch()) { console.log(`${change.operationType}: ${change.fullDocument.i}`); }
Expectation
- In the case of a failover, node-mongodb-native should keep the change stream going without interruption.
- There should not be a 60 second blackout before the changeover is being noticed.
Who is impacted
- This affects all customers using Change Streams.
- It disrupts their users in case of a failover, e.g. when upgrading MongoDB.
Ruling out other problem sources
This is a problem with the node-mongodb-native driver because:
- MongoDB itself correctly re-elects a new primary node as can be observed by rs.status() in mongosh.
- A reproduction with PyMongo in Python does not show this problem: the change events keep being printed even after member A has been stopped, and even after 60 seconds.
- Furthermore, with PyMongo we can restart member A and then stop member B and the events keep being printed.
Discussion
Our application keeps running normally for 60 seconds, except that no more change stream events are being published during that time and until the server then later crashes and restarts.
We were thinking about using resumeAfter, but the 60 seconds blackout doesn't really make this a viable option. Neither the maxAwaitTimeMS nor the serverSelectionTimeoutMS option did have an effect on this timeout.
- is related to
-
DRIVERS-3138 Test more resumable non-server error cases for change streams
-
- Needs Triage
-