[SERVER-37786] Reject change streams which modify or project-out the resume token Created: 26/Oct/18 Updated: 29/Oct/23 Resolved: 03/Feb/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.8 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Charlie Swanson | Assignee: | Bernard Gorman |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Minor Change | ||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.0
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Query 2018-12-17, Query 2018-12-31, Query 2019-01-14, Query 2019-01-28, Query 2019-02-11 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
For example, this pipeline:
Such a pipeline is incompatible with the drivers protocol, and will fail to resume in the face of a network outage. It is not intended to be supported, so we should ban it to provide a more predictable failure scenario. Now that we have completed |
| Comments |
| Comment by Bernard Gorman [ 22/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
shane.harvey, ross.lawley: I've pushed a followup patch which adds a new error code, 'ChangeStreamFatalError', to signify that the stream cannot be resumed. getMore commands may throw this exception in cases where the user's pipeline modifies or discards the resume token. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 22/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Bernard Gorman', 'username': 'gormanb', 'email': 'bernard.gorman@gmail.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bernard Gorman [ 06/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
david.storch: returning CursorKilled would be convenient here, since it would mean that no further work is necessary on the part of the drivers. My concern would be that CursorKilled may be confusing or not very explanatory to the user/application to which the exception is returned. As an alternative, we could add a more accurate code - something like ErrorCodes::InvalidChangeStreamPipeline? Edit: we'd only need a single name to replace both codes, since they are functionally equivalent. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 06/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
bernard.gorman the server team does not promise that unnamed codes remain stable. If the solution here is indeed to have drivers add to the list of non-retryable errors for change streams, then we will need to provide a name (or names?) for codes 51059 and 51060. Alternatively, would it make sense for the server to return CursorKilled in this scenario? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bernard Gorman [ 05/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
shane.harvey: ah, that explains it. Drivers will need to add codes 51059 and 51060 to the list of non-retryable errors, since these indicate that the user must rework their pipeline to avoid modifying _id before the stream can succeed. Edit: marked this ticket as "Driver Changes Needed" to reflect the above. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Shane Harvey [ 05/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
bernard.gorman I am also seeing the behavior ross.lawley describes. Mongos does not return an error and does not ever return any documents (even waiting 10+ minutes). I can reproduce this on a cluster with only a single shard. Edit: I see the issue is in the driver's ChangeStream spec. Drivers run getMore and the server responds with the "Encountered an event whose _id field, which contains the resume token" error, however the driver spec requires that drivers recreate a new cursor after "any server error response from a getMore command excluding those containing the following error codes: 11601 (Interrupted), 136 (CappedPositionLost), and 237 (CursorKilled)." I think this retryable error definition is too broad because it can lead to infinite retry loops like in this ticket. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bernard Gorman [ 05/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
ross.lawley: no, this is behaving as expected. What you're seeing is a consequence of the fact that, on a sharded cluster, the mongoS cannot return any result from any of the shards at time T until it has received notification from all other shards that their oplogs have progressed beyond T. This is because the stream is sorted by clusterTime, and so before we return an event at time T we need to have confirmation from all other shards that they will never return an event that occurred before T. So in this case, you inserted a document via mongoS to one of the shards in the cluster, thereby advancing that shard to oplog time T. But the other shard(s) that were NOT written to may still be at time (T-N), and so we cannot return the insert event. On a sharded cluster, each shard writes a no-op into its oplog every 10 seconds to ensure that its optime is always steadily increasing. If there is no other activity on those shards in the interim, then it may take up 10 seconds for the shards to report to mongoS that they have surpassed T, and that it is now safe to return the insert event. In our tests, we generally use something like assert.soon(() => <condition>) in order to account for the fact that a stream via mongoS may not immediately return a write on the next getMore. In the case of your most recent comment, if you either run the getMore command repeatedly or wait ~10 seconds before issuing it, you should find that it throws the "cannot modify _id" exception before you see the first actual event. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ross Lawley [ 05/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Interestingly bernard.gorman, I'm seeing a different result in the shell:
Tested on: mongodb-linux-x86_64-ubuntu1804-4.1.7-231-g95c5d57d8a Is this a regression or am I using an older version than you? | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ross Lawley [ 05/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi bernard.gorman, Thanks for the follow up. In our driver test we also still see nothing back from the getMore on Mongos even though new documents are added. However, when the same test is run against a replicaset it throws the server error. It looks like from your test case in the shell that it should work as expected, I'll try and investigate further and see if I can determine the cause in the driver. Ross | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bernard Gorman [ 05/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Hi ross.lawley, No, it's not necessary for the driver to set awaitData on the command. The reason you're not seeing an exception is because the validation is performed as change stream events are pulled through the pipeline, not at parse-time. In your example above, the first batch that you received from mongoS is empty because there have not been any events yet, and therefore nothing to validate. Here's an example of the validation in action:
| ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ross Lawley [ 05/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
bernard.gorman I can see the assertion when using Java driver with a replicaset, however when using a Mongos its returning a document back to the driver rather than throwing the same assertion.
Looking at the git commit there is a check to see if the command has set tailable await.
The Java driver does not add awaitData: true to the aggregate command that is sent via the change stream operation. Its unclear that it is a requirement of the Change stream specification. Previously, the server returned documents with out the _id field, rather than an empty array and then the driver validation ensured that the _id field was present. Can you clarify that awaitData: true is required in the command for change streams? I believe other drivers may not set the value and then not trigger the error. | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 04/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Ross Lawley', 'email': 'ross.lawley@gmail.com', 'username': 'rozza'}Message: Fix changestream _id projection test As of | ||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 03/Feb/19 ] | ||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'name': 'Bernard Gorman', 'email': 'bernard.gorman@gmail.com', 'username': 'gormanb'}Message: |