[JAVA-4155] Investigate intermittent failures of reactive `change-streams-resume-whitelist.json:change stream resumes after CursorNotFound` Created: 13/May/21 Updated: 17/May/21 Resolved: 17/May/21 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Test Failure |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Unknown |
| Reporter: | Valentin Kavalenka | Assignee: | Valentin Kavalenka |
| Resolution: | Done | Votes: | 0 |
| Labels: | flaky-tests | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Epic Link: | Investigate flaky tests | ||||||||||||||||
| Documentation Changes: | Not Needed | ||||||||||||||||
| Description |
|
change-streams-resume-whitelist.json:change stream resumes after CursorNotFound sometimes hangs when run via com.mongodb.reactivestreams.client.ChangeStreamsTest. Here are a couple of examples: |
| Comments |
| Comment by Valentin Kavalenka [ 17/May/21 ] | ||
|
Split to | ||
| Comment by Valentin Kavalenka [ 13/May/21 ] | ||
|
Found the root cause (explained below). This bug exists at least since eadbc5c8d85fd3efc21271a2155a068adf4346bf and allows change streams to miss change events each time AsyncQueryBatchCursor.next fails with a retryable error. Reproducing
These changes make the test change-streams-resume-whitelist.json:change stream resumes after CursorNotFound almost always hang when run via com.mongodb.reactivestreams.client.ChangeStreamsTest. ExplanationThe first change increases the probability that the insertOne operation is executed before the very first getMore (the one that fails with 43 CursorNotFound as a result of the configureFailPoint command). The second change increases the probability of AsyncChangeStreamBatchCursor not setting its resumeToken via cachePostBatchResumeToken before it retries the AsyncQueryBatchCursor.next (fails as a result of the very first getMore failing) and calls ChangeStreamOperation.setChangeStreamOptionsForResume. With resumeToken not set on retry, the second cursor created does continue listening where the first cursor (created before executing insertOne because of this waiting) stopped listening, and misses the change notification corresponding to the insertOne operation. FixingI need to make sense of the juggling around resumeToken in AsyncChangeStreamBatchCursor before I can come up with a solution. | ||
| Comment by Valentin Kavalenka [ 13/May/21 ] | ||
|
I don't think this is due to the logging level, because I can see the following killCursors command in the server logs (mongod-vq8d49p9/mongod.log), that was clearly sent by a test run:
However, there is no cursor ID 4715714928219243784 in the corresponding [client logs](https://evergreen.mongodb.com/lobster/evergreen/task/mongo_java_driver_tests_jdk8_unsecure__version~4.4_os~linux_topology~sharded_cluster_auth~noauth_ssl~nossl_jdk~jdk8_test_patch_ddefdb0cc29541f82d507318ef7f94ec03efcd39_609c76c7c9ec444f5206a57c_21_05_13_00_45_59/0/task#bookmarks=0%2C2305%2C2505%2C2519%2C3846&shareLine=2269). So the server does log what I would expect, but the attached server and client logs do not appear to be produced by the same run on Evergreen. The timestamps, on the other hand, match (the following is a client log entry corresponding to the only killCursors command issued by the test):
| ||
| Comment by Jeffrey Yemin [ 13/May/21 ] | ||
|
It's possible that the log verbosity configured for the server is not high enough to show what you're looking for. | ||
| Comment by Valentin Kavalenka [ 13/May/21 ] | ||
|
My best reproduction with TRACE logging enabled and with additional VAKO logs is on a sharded cluster (logs). As far as I can tell, in situations when the test hangs, it hangs by infinitely requesting getMore because the change stream never receives a change document corresponding to the insertOne operation of the test. So the test runner is blocked in the cursor.next until Evergreen terminates the test run. I wanted to see if server logs could clarify whether or not the server ever replied to getMore with a change stream document, but the logs attached by the Evergreen do not seem to correspond to the test run: neither the cursor ID 3414173270393235723 nor the configureFailPoint command are present in the logs. I will try to reproduce on a replica set, maybe with a different topology Evergreen attaches relevant logs, or maybe I am misinterpreting the lack of cursor ID / configureFailPoint in the sharded cluster logs. |