[SERVER-44112] The ChangeStream test fails on the latest sharded cluster Created: 18/Oct/19 Updated: 27/Oct/23 Resolved: 20/Nov/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dmitry Lukyanov (Inactive) | Assignee: | Justin Seyster |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Query 2019-11-04, Query 2019-11-18, Query 2019-12-02 | ||||||||
| Participants: | |||||||||
| Description |
|
There is one change stream JSON test that fails on the latest sharded cluster. Reproducible at least on c# and Java drivers. It looks like the reason of it that the latest server doesn't throw `280` error for the test "Change Stream should error when _id is projected out" anymore: |
| Comments |
| Comment by Justin Seyster [ 20/Nov/19 ] | |||||||||||
|
I'm closing this out with "Works as Designed," but dmitry.lukyanov, jmikola, I filed SPEC-1506 with my suggestion for how to update the tests in a way that would also stop this failure. The best case solution would be to implement both SPEC-1506 and | |||||||||||
| Comment by Jeremy Mikola [ 15/Nov/19 ] | |||||||||||
IIRC, SPEC-1308 was added specifically for the benefit of async drivers. The reporter originally opened this in regard to the Node driver, wherein I believe the iteration method may block until an error is returned or a document becomes available. If neither occurs, it may be possible for the tests to block indefinitely (or for a very long time) and negatively impact the test suite.
This was brought up in several conversations I've had with server engineers in the past year, and I understand the reasoning behind it. I don't see a universal solution that addresses the limitations of drivers with blocking iteration methods. Short of those drivers implementing some workaround to give up on blocking iteration (perhaps with some time out behavior), I think those driver authors should continue to amend the driver test plan with any advice as needed (akin to what was done in SPEC-1308). | |||||||||||
| Comment by Jeffrey Yemin [ 14/Nov/19 ] | |||||||||||
|
dmitry.lukyanov your question is best addressed by the driver team, not the server team. | |||||||||||
| Comment by Dmitry Lukyanov (Inactive) [ 14/Nov/19 ] | |||||||||||
|
Can we open a spec ticket to change the tests description? | |||||||||||
| Comment by Bernard Gorman [ 13/Nov/19 ] | |||||||||||
|
justin.seyster, yep: More broadly, though, tests should essentially never be written such that they expect an event or a failure to occur in a specific change stream batch, or on a particular getMore. $changeStream does not and cannot make any such guarantees; it only promises that events will arrive as quickly as possible and in the correct order. We've run into this problem numerous times in our own tests and the solution has invariably been, as Dave posted above, to assert that the event arrives at some point in the near future rather than on batch N:
| |||||||||||
| Comment by Justin Seyster [ 13/Nov/19 ] | |||||||||||
|
dmitry.lukyanov Sorry for the radio silence on this! After I was investigating on Thursday, I got busy with the Barony, as well as my own Evergreen failure which went critical. I spoke with charlie.swanson about Change Stream behavior, and he confirmed that the behavior we're observing is the intended behavior: a getMore on a tailable, await-data cursor should have a default timeout of 1sec, after which it returns an empty batch. I spent some time going through the spec, but I couldn't find any language describing this behavior, but I also couldn't find any language contradicting it. This is how we want change streams to behave though, because it gives the driver and client an updated post-batch resume token once a second. My opinion is that we should update the test specification to add that, if the iterator function can return an empty batch, the test should continue iterating until it gets either a non-empty batch or an error. I'm thinking of the line Dmitry quoted in the above comment. jmikola, it looks like you added that language in SPEC-1308. I'd be interested in your opinion of the proposed change. bernard.gorman Do you have any idea why EDIT: I just saw that there's already a perf ticket attached to | |||||||||||
| Comment by Justin Seyster [ 08/Nov/19 ] | |||||||||||
|
Quick note for myself: I checked the Evergreen logs to help narrow down what changed. Update: When I run the Java driver version of this test against the version of the server previous to 97cc7b, the first getMore immediately returns the expected error. When I run the same test against 97cc7b, it takes multiple getMore commands to get a response that is not an empty batch. I also wanted to rule out commit f80979 of the C# driver as a possible cause; that's the first commit in Evergreen where this failure occurs, and it seems to involve tailable cursors. I ran f80979 on Evergreen against an older version of the server, and it succeeds, meaning the failure can't be caused by a change in f80979. tl;dr: The failing test used to pass because it got its response in the first getMore. | |||||||||||
| Comment by Justin Seyster [ 02/Nov/19 ] | |||||||||||
|
dmitry.lukyanov Thanks for bearing with me! Good question about the spec. I'm going to have to get back to you on that next week. I'll need to ask the team to figure out which spec doc is relevant here. I was basing my initial answer off my read of the source code. It's clear that it was intentionally designed for the getMore to time out with no results but also no error when when the mongos doesn't see any events from the shards for a period of time. The duration that the mongos waits is controlled by the "awaitDataTimeout" parameter to the getMore command (which is 1,000ms by default). My expectation is that setting a very high "awaitDataTimeout" value would get the test passing again. That's probably not the route we want to take, though. My reading of the linked test specification is that "iterate changeStream once" refers to calling the driver's next() function exactly once without necessarily specifying how many times the next() function issues getNext commands to the server. I'm not familiar with these documents, though, so it's definitely possible that I'm misunderstanding. It sounds like the next step is to get on the same page about what the spec says for tailable+awaitData cursors timing out. I'll see what I can find next week. Query should also make sure to investigate if there's a performance regression in change streams here. Also, I misspoke in my previous message. I'm doing my testing on the "master" branch, which is 4.3, not 4.4. Sorry about that! If I do more testing, I'll try to do it on the specific commits you're using. Thanks again! | |||||||||||
| Comment by Dmitry Lukyanov (Inactive) [ 01/Nov/19 ] | |||||||||||
|
>What manual code changes did you add? the manual change to be able to call a server `getMore` command more than once. >are you referring to a driver "getMore" function or the server "getMore" command? to the server "getMore" command >The client needs to keep calling "getMore" until it sees results. I'm not familiar with this statement, can you point me out on the spec with it? >In the latter case, a "getMore" command that returns 0 results is considered correct behavior from the server yes, the successful `getMore` response contains no documents.
NOTE: this is the description of the test for changeStreams: https://github.com/mongodb/specifications/blob/master/source/change-streams/tests/README.rst#spec-test-runner In particular, this line tells us to iterate changeStream only once:
| |||||||||||
| Comment by Justin Seyster [ 01/Nov/19 ] | |||||||||||
|
dmitry.lukyanov Thanks for the additional info. Sorry, I'm not sure what you mean. What manual code changes did you add? When you say that you call "getMore" twice, are you referring to a driver "getMore" function or the server "getMore" command? In the former case, a driver "getMore" function that returns 0 results may be a driver bug, depending on the driver specification. (I'm not familiar with the C# driver specification.) In the latter case, a "getMore" command that returns 0 results is considered correct behavior from the server. The client needs to keep calling "getMore" until it sees results. When I ran the test locally using the Java driver, I saw that it would keep sending "getMore" commands forever if the server kept sending back empty responses. That said, if mongos used to reliably return the change in the first "getMore" batch but is now taking several seconds to return the result, that may be a real performance issue. By the way, my local testing was against 4.4 top-of-tree. I could try 4.3 next week to see if maybe that will reproduce the issue. Thanks for helping me investigate this! | |||||||||||
| Comment by Dmitry Lukyanov (Inactive) [ 01/Nov/19 ] | |||||||||||
|
>Are you able to reproduce it locally? yes, checked now on the server versions `4.3.0-1818-g6ca9be7` and `4.3.0-2014-g8010c51`.
As I understand the problem is related to this line: https://github.com/mongodb/mongo-csharp-driver/blob/master/tests/MongoDB.Driver.Tests/Specifications/change-streams/tests/change-streams-errors.json#L105 When I call `getMore` command, I don't see any errors. But then when I call `getMore` a second time (after manual code changes), the error with code 280 and message:
has been thrown. NOTE: the error is not reproducible on the server version `4.3.0-762-gf13641a`. UPDATE: The error can be thrown not necessarily after the second attempt. In my tests, I saw the third or the forth attempts when the error has been thrown | |||||||||||
| Comment by Justin Seyster [ 01/Nov/19 ] | |||||||||||
|
After I posted the above comment, I remembered that mongod/mongos only log a COMMAND if it exceeds the "slowms" threshold, which is 100ms by default. It is possible that there is a third getMore call in each test that isn't getting logged because it returns right away, perhaps with nreturned > 0. When I have some time I also plan to look at the logs for one of the successful runs to see how many getMore calls it has to make before it sees the error. | |||||||||||
| Comment by Justin Seyster [ 31/Oct/19 ] | |||||||||||
|
dmitry.lukyanov You mentioned that this reproduces on C# and Java. Are you able to reproduce it locally? Unfortunately, I was not able to make this happen using the Java test. (I didn't try to execute the C# test suite, because I don't have a Windows test machine readily available.) My initial thought was that maybe mongos was returning an empty batch in the getMore response (which is legal), causing the test to give up and fail. In my testing, though, the test keeps sending getMore requests when it gets empty batches from mongos, which is the correct behavior. In the logs from Evergreen, though, the test sends exactly two getMore commands and then gives up. (I'm looking at the mongos logs from this failure). (async = true)
(async = false)
The two mysteries are: | |||||||||||
| Comment by David Storch [ 23/Oct/19 ] | |||||||||||
|
I tried to reproduce this using the server's infrastructure for testing sharded clusters, but the change stream failed with the expected code (code:280, codeName:ChangeStreamFatalError).
What's the topology of the sharded cluster used by the drivers test? Is there any special configuration of the cluster we need in order to repro this? | |||||||||||
| Comment by Kaitlin Mahar [ 23/Oct/19 ] | |||||||||||
|
I believe the test expects the failure on the first getMore. The linked JSON test has drivers do the following: 1) create a change stream, which sends the initial aggregate command 2) do an insertOne to generate a new event 3) try to iterate the change stream to get the new event. this will trigger a getMore, since the initial batch returned by aggregate was empty. the getMore returns the error. | |||||||||||
| Comment by Ian Boros [ 23/Oct/19 ] | |||||||||||
|
Does this test expect the aggregation to fail immediately (as part of the 'aggregate' command being sent)? Does it attempt to run any getMores? I cannot tell from the logs provided. |