[SERVER-44112] The ChangeStream test fails on the latest sharded cluster Created: 18/Oct/19  Updated: 27/Oct/23  Resolved: 20/Nov/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dmitry Lukyanov (Inactive) Assignee: Justin Seyster
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-42723 New shard with new database can be ig... Closed
Operating System: ALL
Sprint: Query 2019-11-04, Query 2019-11-18, Query 2019-12-02
Participants:

 Description   

There is one change stream JSON test that fails on the latest sharded cluster. Reproducible at least on c# and Java drivers.

Json test: https://github.com/mongodb/specifications/blob/master/source/change-streams/tests/change-streams-errors.json#L76

It looks like the reason of it that the latest server doesn't throw `280` error for the test "Change Stream should error when _id is projected out" anymore:
https://github.com/mongodb/specifications/blob/master/source/change-streams/tests/change-streams-errors.json#L103
So, our test runner doesn't catch the expected error and then fails.

The evergreen task: https://evergreen.mongodb.com/task_log_raw/dot_net_driver_unsecure_tests__version~latest_os~windows_64_topology~sharded_cluster_auth~noauth_ssl~nossl_test_f809794cf52538bd3f6b5c5ba5b9a74b17257fb7_19_10_17_23_46_15/2?type=T#L2824



 Comments   
Comment by Justin Seyster [ 20/Nov/19 ]

I'm closing this out with "Works as Designed," but SERVER-44356 should fix the failing driver tests nonetheless. If that turns out not to be the case, please feel free to re-open or file a new ticket.

dmitry.lukyanov, jmikola, I filed SPEC-1506 with my suggestion for how to update the tests in a way that would also stop this failure. The best case solution would be to implement both SPEC-1506 and SERVER-44356.

Comment by Jeremy Mikola [ 15/Nov/19 ]

My opinion is that we should update the test specification to add that, if the iterator function can return an empty batch, the test should continue iterating until it gets either a non-empty batch or an error. I'm thinking of the line Dmitry quoted in the above comment. Jeremy Mikola, it looks like you added that language in SPEC-1308. I'd be interested in your opinion of the proposed change.

IIRC, SPEC-1308 was added specifically for the benefit of async drivers. The reporter originally opened this in regard to the Node driver, wherein I believe the iteration method may block until an error is returned or a document becomes available. If neither occurs, it may be possible for the tests to block indefinitely (or for a very long time) and negatively impact the test suite.

More broadly, though, tests should essentially never be written such that they expect an event or a failure to occur in a specific change stream batch, or on a particular getMore

This was brought up in several conversations I've had with server engineers in the past year, and I understand the reasoning behind it. I don't see a universal solution that addresses the limitations of drivers with blocking iteration methods. Short of those drivers implementing some workaround to give up on blocking iteration (perhaps with some time out behavior), I think those driver authors should continue to amend the driver test plan with any advice as needed (akin to what was done in SPEC-1308).

Comment by Jeffrey Yemin [ 14/Nov/19 ]

dmitry.lukyanov your question is best addressed by the driver team, not the server team.

Comment by Dmitry Lukyanov (Inactive) [ 14/Nov/19 ]

Can we open a spec ticket to change the tests description?

Comment by Bernard Gorman [ 13/Nov/19 ]

justin.seyster, yep: SERVER-44356. Briefly, we now open an additional cursor on the config servers to monitor for new shards, and so the latency of change streams may be limited by the write rate on the config servers. We're looking into ways to minimise or alleviate this latency.

More broadly, though, tests should essentially never be written such that they expect an event or a failure to occur in a specific change stream batch, or on a particular getMore. $changeStream does not and cannot make any such guarantees; it only promises that events will arrive as quickly as possible and in the correct order. We've run into this problem numerous times in our own tests and the solution has invariably been, as Dave posted above, to assert that the event arrives at some point in the near future rather than on batch N:

assert.soon(() => changeStream.hasNext());

Comment by Justin Seyster [ 13/Nov/19 ]

dmitry.lukyanov Sorry for the radio silence on this! After I was investigating on Thursday, I got busy with the Barony, as well as my own Evergreen failure which went critical.

I spoke with charlie.swanson about Change Stream behavior, and he confirmed that the behavior we're observing is the intended behavior: a getMore on a tailable, await-data cursor should have a default timeout of 1sec, after which it returns an empty batch. I spent some time going through the spec, but I couldn't find any language describing this behavior, but I also couldn't find any language contradicting it. This is how we want change streams to behave though, because it gives the driver and client an updated post-batch resume token once a second.

My opinion is that we should update the test specification to add that, if the iterator function can return an empty batch, the test should continue iterating until it gets either a non-empty batch or an error. I'm thinking of the line Dmitry quoted in the above comment. jmikola, it looks like you added that language in SPEC-1308. I'd be interested in your opinion of the proposed change.

bernard.gorman Do you have any idea why SERVER-42723 might have changed the timing for this test? Even if we modify the test so that it passes again, I think we should address the performance change.

EDIT: I just saw that there's already a perf ticket attached to SERVER-42723: BF-15221. It's likely that this issue has the same cause.

Comment by Justin Seyster [ 08/Nov/19 ]

Quick note for myself: I checked the Evergreen logs to help narrow down what changed.
The first task that failed with this error was using revision dab2ae229660af56a786a084ebc36666c4dd6a91 of the server.
The task before that (which succeeded) was using revision 7264dd03a981ea9cc86dcdc073e7a6dedaa995af.

Update:
I did some more experimenting, and this test failure appears to be related to commit 97cc7b5838db4ef13ede3149c44bceca8f5c2977 of the server (SERVER-42723).

When I run the Java driver version of this test against the version of the server previous to 97cc7b, the first getMore immediately returns the expected error. When I run the same test against 97cc7b, it takes multiple getMore commands to get a response that is not an empty batch.

I also wanted to rule out commit f80979 of the C# driver as a possible cause; that's the first commit in Evergreen where this failure occurs, and it seems to involve tailable cursors. I ran f80979 on Evergreen against an older version of the server, and it succeeds, meaning the failure can't be caused by a change in f80979.

tl;dr: The failing test used to pass because it got its response in the first getMore. SERVER-42723 changed mongos/mongod behavior so that it now takes multiple getMore commands to get the expected error, and the C# driver is not designed to retry the getMore, which is the caues of this failure.

Comment by Justin Seyster [ 02/Nov/19 ]

dmitry.lukyanov Thanks for bearing with me! Good question about the spec. I'm going to have to get back to you on that next week. I'll need to ask the team to figure out which spec doc is relevant here.

I was basing my initial answer off my read of the source code. It's clear that it was intentionally designed for the getMore to time out with no results but also no error when when the mongos doesn't see any events from the shards for a period of time. The duration that the mongos waits is controlled by the "awaitDataTimeout" parameter to the getMore command (which is 1,000ms by default). My expectation is that setting a very high "awaitDataTimeout" value would get the test passing again. That's probably not the route we want to take, though.

My reading of the linked test specification is that "iterate changeStream once" refers to calling the driver's next() function exactly once without necessarily specifying how many times the next() function issues getNext commands to the server. I'm not familiar with these documents, though, so it's definitely possible that I'm misunderstanding.

It sounds like the next step is to get on the same page about what the spec says for tailable+awaitData cursors timing out. I'll see what I can find next week. Query should also make sure to investigate if there's a performance regression in change streams here.

Also, I misspoke in my previous message. I'm doing my testing on the "master" branch, which is 4.3, not 4.4. Sorry about that! If I do more testing, I'll try to do it on the specific commits you're using. Thanks again!

Comment by Dmitry Lukyanov (Inactive) [ 01/Nov/19 ]

>What manual code changes did you add? 

the manual change to be able to call a server `getMore` command more than once.

>are you referring to a driver "getMore" function or the server "getMore" command?

to the server "getMore" command

>The client needs to keep calling "getMore" until it sees results.

I'm not familiar with this statement, can you point me out on the spec with it?

>In the latter case, a "getMore" command that returns 0 results is considered correct behavior from the server

yes, the successful `getMore` response contains no documents.

 

NOTE: this is the description of the test for changeStreams: https://github.com/mongodb/specifications/blob/master/source/change-streams/tests/README.rst#spec-test-runner

In particular, this line tells us to iterate changeStream only once:

If there was no error and result.error is set, iterate changeStream once and capture any error.

Comment by Justin Seyster [ 01/Nov/19 ]

dmitry.lukyanov Thanks for the additional info. Sorry, I'm not sure what you mean. What manual code changes did you add? When you say that you call "getMore" twice, are you referring to a driver "getMore" function or the server "getMore" command?

In the former case, a driver "getMore" function that returns 0 results may be a driver bug, depending on the driver specification. (I'm not familiar with the C# driver specification.) In the latter case, a "getMore" command that returns 0 results is considered correct behavior from the server. The client needs to keep calling "getMore" until it sees results. When I ran the test locally using the Java driver, I saw that it would keep sending "getMore" commands forever if the server kept sending back empty responses.

That said, if mongos used to reliably return the change in the first "getMore" batch but is now taking several seconds to return the result, that may be a real performance issue.

By the way, my local testing was against 4.4 top-of-tree. I could try 4.3 next week to see if maybe that will reproduce the issue. Thanks for helping me investigate this!

Comment by Dmitry Lukyanov (Inactive) [ 01/Nov/19 ]

>Are you able to reproduce it locally?

yes, checked now on the server versions `4.3.0-1818-g6ca9be7` and `4.3.0-2014-g8010c51`.

 

As I understand the problem is related to this line: https://github.com/mongodb/mongo-csharp-driver/blob/master/tests/MongoDB.Driver.Tests/Specifications/change-streams/tests/change-streams-errors.json#L105

When I call `getMore` command, I don't see any errors. But then when I call `getMore` a second time (after manual code changes), the error with code 280 and message:

Command getMore failed: Encountered an event whose _id field, which contains the resume token, was modified by the pipeline. Modifying the _id field of an event makes it impossible to resume the stream from that point. Only transformations that retain the unmodified _id field are allowed.

has been thrown.

NOTE: the error is not reproducible on the server version `4.3.0-762-gf13641a`.

UPDATE: The error can be thrown not necessarily after the second attempt. In my tests, I saw the third or the forth attempts when the error has been thrown

Comment by Justin Seyster [ 01/Nov/19 ]

After I posted the above comment, I remembered that mongod/mongos only log a COMMAND if it exceeds the "slowms" threshold, which is 100ms by default. It is possible that there is a third getMore call in each test that isn't getting logged because it returns right away, perhaps with nreturned > 0.

When I have some time I also plan to look at the logs for one of the successful runs to see how many getMore calls it has to make before it sees the error.

Comment by Justin Seyster [ 31/Oct/19 ]

dmitry.lukyanov You mentioned that this reproduces on C# and Java. Are you able to reproduce it locally? Unfortunately, I was not able to make this happen using the Java test. (I didn't try to execute the C# test suite, because I don't have a Windows test machine readily available.)

My initial thought was that maybe mongos was returning an empty batch in the getMore response (which is legal), causing the test to give up and fail. In my testing, though, the test keeps sending getMore requests when it gets empty batches from mongos, which is the correct behavior.

In the logs from Evergreen, though, the test sends exactly two getMore commands and then gives up. (I'm looking at the mongos logs from this failure).

(async = true)

./mci/8f82fe232dda6f5a77d481ac8871feff/drivers-tools/.evergreen/orchestration/db/mongo-65jwqy/mongos.log:2019-10-29T19:14:58.371+0000 I  COMMAND  [conn2171] command change-stream-tests.test command: getMore { getMore: 6199935151688990082, collection: "test", $db: "change-stream-tests", lsid: { id: UUID("328c2bd5-eb89-41b4-9bb3-8ea505c3059b") }, $clusterTime: { clusterTime: Timestamp(1572376497, 24), signature: { hash: BinData(0, 79F3B098B291669A30AE2504FD09FB07BE31D898), keyId: 6753301787618312232 } } } originatingCommand: { aggregate: "test", pipeline: [ { $changeStream: {} }, { $project: { _id: 0 } } ], cursor: {}, $db: "change-stream-tests", lsid: { id: UUID("328c2bd5-eb89-41b4-9bb3-8ea505c3059b") } } nShards:2 cursorid:6199935151688990082 numYields:0 nreturned:0 reslen:306 protocol:op_msg 1000ms
./mci/8f82fe232dda6f5a77d481ac8871feff/drivers-tools/.evergreen/orchestration/db/mongo-65jwqy/mongos.log:2019-10-29T19:14:59.632+0000 I  COMMAND  [conn2173] command change-stream-tests.test command: getMore { getMore: 2518998828756750331, collection: "test", $db: "change-stream-tests", lsid: { id: UUID("07b8c00b-6bbe-4ea9-aa0b-b018bb41b937") }, $clusterTime: { clusterTime: Timestamp(1572376498, 34), signature: { hash: BinData(0, B297D66CED85ED2C9889BE8CDF2B93B35F696D85), keyId: 6753301787618312232 } } } originatingCommand: { aggregate: "test", pipeline: [ { $changeStream: {} }, { $project: { _id: 0 } } ], cursor: {}, $db: "change-stream-tests", lsid: { id: UUID("07b8c00b-6bbe-4ea9-aa0b-b018bb41b937") } } nShards:2 cursorid:2518998828756750331 numYields:0 nreturned:0 reslen:306 protocol:op_msg 1001ms

(async = false)

./mci/8f82fe232dda6f5a77d481ac8871feff/drivers-tools/.evergreen/orchestration/db/mongo-65jwqy/mongos.log:2019-10-29T19:25:43.681+0000 I  COMMAND  [conn7919] command change-stream-tests.test command: getMore { getMore: 9096175620065252820, collection: "test", $db: "change-stream-tests", lsid: { id: UUID("2fd89215-7d08-4067-90c1-27912443afbe") }, $clusterTime: { clusterTime: Timestamp(1572377142, 60), signature: { hash: BinData(0, 19DB81E36C8F74DECE6442DC1F232F516E1F06C5), keyId: 6753301787618312232 } } } originatingCommand: { aggregate: "test", pipeline: [ { $changeStream: {} }, { $project: { _id: 0 } } ], cursor: {}, $db: "change-stream-tests", lsid: { id: UUID("2fd89215-7d08-4067-90c1-27912443afbe") } } nShards:2 cursorid:9096175620065252820 numYields:0 nreturned:0 reslen:306 protocol:op_msg 1001ms
./mci/8f82fe232dda6f5a77d481ac8871feff/drivers-tools/.evergreen/orchestration/db/mongo-65jwqy/mongos.log:2019-10-29T19:25:44.979+0000 I  COMMAND  [conn7921] command change-stream-tests.test command: getMore { getMore: 3119740060742483673, collection: "test", $db: "change-stream-tests", lsid: { id: UUID("08c1b04c-7ed5-4fb1-83e5-18a94f52c9b8") }, $clusterTime: { clusterTime: Timestamp(1572377143, 34), signature: { hash: BinData(0, 4D757549276B6B6BA660E1E7F8E74BB2F4307AA2), keyId: 6753301787618312232 } } } originatingCommand: { aggregate: "test", pipeline: [ { $changeStream: {} }, { $project: { _id: 0 } } ], cursor: {}, $db: "change-stream-tests", lsid: { id: UUID("08c1b04c-7ed5-4fb1-83e5-18a94f52c9b8") } } nShards:2 cursorid:3119740060742483673 numYields:0 nreturned:0 reslen:306 protocol:op_msg 1001ms

The two mysteries are:
1) Why is mongos taking so long to respond with the change? The "await data" timeout is 1 second, meaning that the mongos has no events for two full seconds. Not impossible, but longer than I would expect.
2) Why is the driver giving up? Dmitry, if you can reproduce this locally, would it be possible for you to use the debugger or logging to determine why the driver stops retrying? That would be my next step if I could reproduce this and/or if I had a better understanding of the C# driver.

Comment by David Storch [ 23/Oct/19 ]

I tried to reproduce this using the server's infrastructure for testing sharded clusters, but the change stream failed with the expected code (code:280, codeName:ChangeStreamFatalError).

(function() {
"use strict";
 
let st = new ShardingTest({shards: 2});
const testDb = st.s.getDB("test");
st.shardColl(testDb.c, {_id: 1}, {_id: 0}, {_id: 0});
 
let changeStream = testDb.c.watch([{$project: {_id: 0}}]);
assert.writeOK(testDb.c.insert({_id: 1}));
assert.soon(() => changeStream.hasNext());
}());

What's the topology of the sharded cluster used by the drivers test? Is there any special configuration of the cluster we need in order to repro this?

Comment by Kaitlin Mahar [ 23/Oct/19 ]

I believe the test expects the failure on the first getMore. The linked JSON test has drivers do the following:

1) create a change stream, which sends the initial aggregate command

2) do an insertOne to generate a new event

3) try to iterate the change stream to get the new event. this will trigger a getMore, since the initial batch returned by aggregate was empty. the getMore returns the error.

Comment by Ian Boros [ 23/Oct/19 ]

Does this test expect the aggregation to fail immediately (as part of the 'aggregate' command being sent)? Does it attempt to run any getMores? I cannot tell from the logs provided.

Generated at Thu Feb 08 05:05:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.