-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Critical - P2
-
None
-
Affects Version/s: 6.17.0
-
Component/s: Change Streams
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description
When iterating through a change stream purely using `changeStream.tryNext()`, the stream's resumeToken is not updated correctly.
In many cases this is not an issue - you can use `event._id` to get the resume token. However, if the stream is retried due to a "ResumableChangeStreamError", it restarts from the outdated resumeToken. This could be very far behind (we've had cases in production where it goes back multiple hours), and can lead to data consistency issues due to the stream not being in the correct order anymore.
The issue does not occur in these cases:
- If the stream is idle for more than the maxAwaitTimeMS period, then tryNext() returns null and the resumeToken is updated. This test case below inserts documents often enough to avoid this case.
- When using `changeStream.next()` instead of `changeStream.tryNext()`, the resumeToken is updated correctly and the issue does not occur. In our case we specifically want the `tryNext()` functionality, to trigger .
Reproducing
To reproduce the issue, see this script:
https://gist.github.com/rkistner/fef2013ca9eb86aee883fc80b8267382
First, run the script and notice that the "Stream resumeToken" never updates, while the "Change resumeToken" does update.
Then, to trigger the out-of-order issue, we need to introduce a ResumableChangeStreamError. On my machine I can reproduce this by testing against a single MongoDB 8.0 node, then restarting it while running the script. In some cases it results in a InterruptedAtShutdown, in which case we just restart the test. But in many cases it results in a CursorNotFound error (not directly visible), leading to the stream being restarted with the old stream resumeToken, then hitting the "Resume token out of order" check.
I've included logs in the gist:
- Plain logs, just showing the resumeToken sequence.
- Logs with MONGODB_LOG_ALL=debug. This shows the change stream restarting with the incorrect resumeAfter token.
Workaround
The only workaround we have at the moment is to compare resumeTokens on the client, and re-create the stream when we notice the invalid ordering.
- depends on
-
NODE-4763 changestream tryNext should update the resume token when returning a change
-
- Backlog
-