[SERVER-37786] Reject change streams which modify or project-out the resume token Created: 26/Oct/18  Updated: 29/Oct/23  Resolved: 03/Feb/19

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: 4.1.8

Type: Improvement Priority: Major - P3
Reporter: Charlie Swanson Assignee: Bernard Gorman
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by JAVA-2828 ChangeStreamIterable withDocumentClas... Closed
is depended on by SERVER-38414 Upgrade/Downgrade testing for change ... Closed
is depended on by SERVER-38408 Return postBatchResumeToken from all ... Closed
Documented
is documented by DOCS-12423 Docs for SERVER-37786: Reject change ... Closed
Problem/Incident
Related
related to PHPLIB-446 Skip tests that modify resume token f... Closed
related to SERVER-37669 Allow change streams to project out t... Backlog
is related to JAVA-3163 Re-enable the 'should throw if the _i... Closed
Backwards Compatibility: Minor Change
Backport Requested:
v4.0
Sprint: Query 2018-12-17, Query 2018-12-31, Query 2019-01-14, Query 2019-01-28, Query 2019-02-11
Participants:

 Description   

For example, this pipeline:

db.x.watch([{$replaceRoot: {newRoot: "$fullDocument"}}])

Such a pipeline is incompatible with the drivers protocol, and will fail to resume in the face of a network outage. It is not intended to be supported, so we should ban it to provide a more predictable failure scenario. Now that we have completed SERVER-35904, it shouldn't be very difficult to verify the _id is preserved by the pipeline.



 Comments   
Comment by Bernard Gorman [ 22/Feb/19 ]

shane.harvey, ross.lawley: I've pushed a followup patch which adds a new error code, 'ChangeStreamFatalError', to signify that the stream cannot be resumed. getMore commands may throw this exception in cases where the user's pipeline modifies or discards the resume token.

Comment by Githook User [ 22/Feb/19 ]

Author:

{'name': 'Bernard Gorman', 'username': 'gormanb', 'email': 'bernard.gorman@gmail.com'}

Message: SERVER-37786 Add new ChangeStreamFatalError code
Branch: master
https://github.com/mongodb/mongo/commit/20fc08af78154399615312db92952fdcd5c61088

Comment by Bernard Gorman [ 06/Feb/19 ]

david.storch: returning CursorKilled would be convenient here, since it would mean that no further work is necessary on the part of the drivers. My concern would be that CursorKilled may be confusing or not very explanatory to the user/application to which the exception is returned. As an alternative, we could add a more accurate code - something like ErrorCodes::InvalidChangeStreamPipeline?

Edit: we'd only need a single name to replace both codes, since they are functionally equivalent.

Comment by David Storch [ 06/Feb/19 ]

bernard.gorman the server team does not promise that unnamed codes remain stable. If the solution here is indeed to have drivers add to the list of non-retryable errors for change streams, then we will need to provide a name (or names?) for codes 51059 and 51060.

Alternatively, would it make sense for the server to return CursorKilled in this scenario?

Comment by Bernard Gorman [ 05/Feb/19 ]

shane.harvey: ah, that explains it. Drivers will need to add codes 51059 and 51060 to the list of non-retryable errors, since these indicate that the user must rework their pipeline to avoid modifying _id before the stream can succeed.

Edit: marked this ticket as "Driver Changes Needed" to reflect the above.

Comment by Shane Harvey [ 05/Feb/19 ]

bernard.gorman I am also seeing the behavior ross.lawley describes. Mongos does not return an error and does not ever return any documents (even waiting 10+ minutes). I can reproduce this on a cluster with only a single shard.

Edit: I see the issue is in the driver's ChangeStream spec. Drivers run getMore and the server responds with the "Encountered an event whose _id field, which contains the resume token" error, however the driver spec requires that drivers recreate a new cursor after "any server error response from a getMore command excluding those containing the following error codes: 11601 (Interrupted), 136 (CappedPositionLost), and 237 (CursorKilled)."

I think this retryable error definition is too broad because it can lead to infinite retry loops like in this ticket.

Comment by Bernard Gorman [ 05/Feb/19 ]

ross.lawley: no, this is behaving as expected. What you're seeing is a consequence of the fact that, on a sharded cluster, the mongoS cannot return any result from any of the shards at time T until it has received notification from all other shards that their oplogs have progressed beyond T. This is because the stream is sorted by clusterTime, and so before we return an event at time T we need to have confirmation from all other shards that they will never return an event that occurred before T.

So in this case, you inserted a document via mongoS to one of the shards in the cluster, thereby advancing that shard to oplog time T. But the other shard(s) that were NOT written to may still be at time (T-N), and so we cannot return the insert event. On a sharded cluster, each shard writes a no-op into its oplog every 10 seconds to ensure that its optime is always steadily increasing. If there is no other activity on those shards in the interim, then it may take up 10 seconds for the shards to report to mongoS that they have surpassed T, and that it is now safe to return the insert event.

In our tests, we generally use something like assert.soon(() => <condition>) in order to account for the fact that a stream via mongoS may not immediately return a write on the next getMore. In the case of your most recent comment, if you either run the getMore command repeatedly or wait ~10 seconds before issuing it, you should find that it throws the "cannot modify _id" exception before you see the first actual event.

Comment by Ross Lawley [ 05/Feb/19 ]

Interestingly bernard.gorman, I'm seeing a different result in the shell:

mongos> let cmdRlet cmdRes = db.runCommand({aggregate: "testing", pipeline: [{$changeStream: {}}, {$project: {_id: 0}}], cursor: {}})
mongos> db.runCommand({getMore: cmdRes.cursor.id, collection: "testing"})
{
	"cursor" : {
		"id" : NumberLong("1082300309264392010"),
		"ns" : "test.testing",
		"nextBatch" : [ ],
		"postBatchResumeToken" : {
			"_data" : "825C59869A0000000C2B0229296E5A100477D471CB7C534B468B7ED6D9235AB46704"
		}
	},
	"ok" : 1,
	"operationTime" : Timestamp(1549371034, 12),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1549371034, 12),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
}
mongos> db.testing.insert({a: 1})
WriteResult({ "nInserted" : 1 })
mongos> db.runCommand({getMore: cmdRes.cursor.id, collection: "testing"})
{
	"cursor" : {
		"id" : NumberLong("1082300309264392010"),
		"ns" : "test.testing",
		"nextBatch" : [ ],
		"postBatchResumeToken" : {
			"_data" : "825C59869A0000000C2B0229296E5A100477D471CB7C534B468B7ED6D9235AB46704"
		}
	},
	"ok" : 1,
	"operationTime" : Timestamp(1549371038, 1),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1549371038, 1),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
}
mongos> 

Tested on: mongodb-linux-x86_64-ubuntu1804-4.1.7-231-g95c5d57d8a

Is this a regression or am I using an older version than you?

Comment by Ross Lawley [ 05/Feb/19 ]

Hi bernard.gorman,

Thanks for the follow up. In our driver test we also still see nothing back from the getMore on Mongos even though new documents are added. However, when the same test is run against a replicaset it throws the server error.

It looks like from your test case in the shell that it should work as expected, I'll try and investigate further and see if I can determine the cause in the driver.

Ross

Comment by Bernard Gorman [ 05/Feb/19 ]

Hi ross.lawley,

No, it's not necessary for the driver to set awaitData on the command. The reason you're not seeing an exception is because the validation is performed as change stream events are pulled through the pipeline, not at parse-time. In your example above, the first batch that you received from mongoS is empty because there have not been any events yet, and therefore nothing to validate.

Here's an example of the validation in action:

mongos> let cmdRes = db.runCommand({aggregate: "testing", pipeline: [{$changeStream: {}}, {$project: {_id: "replaced"}}], cursor: {}})
mongos> db.runCommand({getMore: cmdRes.cursor.id, collection: "testing"})
{
	"cursor" : {
		"id" : NumberLong("8847843243988352523"),
		"ns" : "test.testing",
		"nextBatch" : [ ],
		"postBatchResumeToken" : {
			"_data" : "825C597315000000022B0229296E5A1004098E342548A14A5C8D85C3DF5491ABB604"
		}
	},
	"ok" : 1,
	"operationTime" : Timestamp(1549366041, 1),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1549366041, 1),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
}
 
mongos> db.testing.insert({a: 1})
WriteResult({ "nInserted" : 1 })
 
mongos> db.runCommand({getMore: cmdRes.cursor.id, collection: "testing"})
{
	"ok" : 0,
	"errmsg" : "Encountered an event whose _id field, which contains the resume token, was modified by the pipeline. Modifying the _id field of an event makes it impossible to resume the stream from that point. Only transformations that retain the unmodified _id field are allowed. Expected: { _id: { _data: \"825C597324000000012B022C0100296E5A1004098E342548A14A5C8D85C3DF5491ABB646645F696400645C597324F51E87EBD01701EE0004\" } } but found: { _id: \"replaced\" }",
	"code" : 51060,
	"codeName" : "Location51060",
	"operationTime" : Timestamp(1549366057, 1),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1549366057, 1),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
}

Comment by Ross Lawley [ 05/Feb/19 ]

bernard.gorman I can see the assertion when using Java driver with a replicaset, however when using a Mongos its returning a document back to the driver rather than throwing the same assertion.

{firstBatch: [], postBatchResumeToken: {"_data": "825C5965E6000000112B0229296E5A10047DB5A4C442D64D7D9929166334AB4FFD04"}, 
id: 1764347339596097463, ns: "JavaDriverTest.com.mongodb.operation.ChangeStreamOperationSpecification"}

Looking at the git commit there is a check to see if the command has set tailable await.

    // If this is not a change stream pipeline, we have nothing to do except return the BSONObj.
    if (!_mergePipeline->getContext()->isTailableAwaitData()) {
        return event.toBson();
    }

The Java driver does not add awaitData: true to the aggregate command that is sent via the change stream operation. Its unclear that it is a requirement of the Change stream specification.

Previously, the server returned documents with out the _id field, rather than an empty array and then the driver validation ensured that the _id field was present.

Can you clarify that awaitData: true is required in the command for change streams? I believe other drivers may not set the value and then not trigger the error.

Comment by Githook User [ 04/Feb/19 ]

Author:

{'name': 'Ross Lawley', 'email': 'ross.lawley@gmail.com', 'username': 'rozza'}

Message: Fix changestream _id projection test

As of SERVER-37786 the server now rejects the removal of the _id
Branch: master
https://github.com/mongodb/mongo-java-driver/commit/7a8c4580e6bde45beabe0e3e308658f275e38e05

Comment by Githook User [ 03/Feb/19 ]

Author:

{'name': 'Bernard Gorman', 'email': 'bernard.gorman@gmail.com', 'username': 'gormanb'}

Message: SERVER-37786 Reject change stream pipelines which modify or project-out the resume token
Branch: master
https://github.com/mongodb/mongo/commit/065f3ef77de57609d92fce482c1e4e36b15cf29c

Generated at Thu Feb 08 04:47:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.