[SERVER-34818] changeStream option startAtClusterTime should be a raw BSON Timestamp Created: 02/May/18  Updated: 29/Oct/23  Resolved: 17/May/18

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: 4.0.0-rc0

Type: Task Priority: Major - P3
Reporter: Daniel Aprahamian (Inactive) Assignee: Charlie Swanson
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
is documented by DOCS-11715 Docs for SERVER-34818: changeStream o... Closed
Related
Backwards Compatibility: Fully Compatible
Participants:

 Description   

In 4.0, we will be allowing users to specify a startAtClusterTime option. Currently, the server expects the option to be a BSON document of the form:

{ ts: Timestamp }

This naming not intuitive, as a clusterTime object (as defined in server response $clusterTime.clusterTime) does not conform to this format.

We should either modify the startAtClusterTime field to expect a raw BSON timestamp, or choose a more intuitive name for the field.



 Comments   
Comment by Githook User [ 17/May/18 ]

Author:

{'email': 'charlie.swanson@mongodb.com', 'username': 'cswanson310', 'name': 'Charlie Swanson'}

Message: SERVER-34818 Rename startAtClusterTime to startAtOperationTime

Also changes the type of the argument from an object with a single 'ts'
field to just a Timestamp.
Branch: master
https://github.com/mongodb/mongo/commit/664e1d9b01dccddeec072b7746d7d4c62931716d

Comment by Spencer Brody (Inactive) [ 07/May/18 ]

We need to enable Cloud Automation to upgrade from 3.7.5 to 4.0.0 without needing to re-run setFCV.  We don't need to guarantee that applications written against 3.7.5 can upgrade to 4.0.0 without a code change - that is, we're still allowed to change our public APIs up until at least RC0 (and probably could for GA if it was important enough).

Comment by Jeremy Mikola [ 07/May/18 ]

I'm not sure if this is too late - it'd certainly take some additional server effort to change the parser, and we'd likely have to support both formats because we want to support upgrade from 3.7.5 to 4.0.0 (I've been told). My hunch is yes unless you feel strongly about this.

This just stood out as a minor API inconsistency worth addressing while we still can. I understand if there are insufficient resources to address this for the 4.0 release.

I'm not sure how this relates to supporting an upgrade path from 3.7.5 to 4.0.0, since we don't document the $changeStream operator or expect users to run it directly in their own aggregation pipelines – unless you referring to beta drivers that have already implemented a watch() helper (or perhaps ensuring the 3.7.5 shell can be used against a 4.0.0 server).

Comment by Misha Tyulenev [ 04/May/18 ]

As long as this value will not cause  the clusterTime to advance validation is not needed. In the case of afterClusterTime the validation is needed because otherwise it will hang waiting for read concern if the value is chosen far in the future.

Comment by Charlie Swanson [ 04/May/18 ]

Oh, also - jmikola's guess of where this time will come from seems good to me, though asya might know more - I seem to recall some customers were asking for this, independent of our need for it for resuming a stream which hasn't seen anything.

Comment by Charlie Swanson [ 04/May/18 ]

jmikola I'm not sure if this is too late - it'd certainly take some additional server effort to change the parser, and we'd likely have to support both formats because we want to support upgrade from 3.7.5 to 4.0.0 (I've been told). My hunch is yes unless you feel strongly about this.

misha.tyulenev why should we validate that? We chose not to, since the only thing that would happen is that your stream would get no notifications. We don't use this value to modify the system's clock at all. We thought about adding an assert though.

Comment by Misha Tyulenev [ 04/May/18 ]

Note that the server should validate that the user provided time is not greater than the clusterTime.

Comment by Jeremy Mikola [ 04/May/18 ]

To Jeremy's point, the "ts" field is absolutely superfluous, but we decided to use an object before because we weren't sure if a Timestamp would always be sufficient as a way of representing a cluster time (that project was in-flight at the time of the original '$_resumeWithClusterTime' option which we renamed here to 'startAtClusterTime').

Given what we now know, is there any reason to keep the "ts" field in 4.0 or does this fall under "too late to change it now"? Looking at the current implementation of mongodb/specifications#311 for SPEC-1057, the drivers will accept a BSON timestamp as an option. I suppose drivers can simply wrap the BSON timestamp (be it from the session or a user-provided value) in a "ts" field, but I do think it'd be preferable to clean up the $changeStream API if possible.

Finally, One point in response to Jeremy's concern is that we might expect a user to provide a timestamp gathered from the oplog (e.g. maybe they use 'the oldest thing in the oplog' or 'the time from the operation right before I created that collection').

This leads me to believe that any user-supplied value would come from one of three places: Session.getOperationTime(), an op log document (if users are iterating manually), or be a manually constructed BSON timestamp (likely a timestamp integer with an increment of zero). We can rule out Session.getClusterTime() because we tell users those are opaque documents and they generally shouldn't extract its "clusterTime" field on their own.

Comment by Charlie Swanson [ 03/May/18 ]

Hello all! I hope I can provide some clarification here. It's my understanding that an 'operationTime' is one particular cluster time. That is, out of all the values of Timestamps that have been used as a "cluster time" (returned with $clusterTime), an operationTime has to be one of them. I would have proposed 'startAtTime' for the name of this option, but I wanted to make it clear that it was not an ISODate or something like that. 'startAtOperationTime' seems fine, but logically you should be able to start at any of the cluster times, so 'startAtClusterTime' seemed broader to me.

I think that using either the 'operationTime' or the 'clusterTime' for this is fine from a correctness point of view, so if Misha has objections to looking within the '$clusterTime' field then using operationTime is good.

To Jeremy's point, the "ts" field is absolutely superfluous, but we decided to use an object before because we weren't sure if a Timestamp would always be sufficient as a way of representing a cluster time (that project was in-flight at the time of the original '$_resumeWithClusterTime' option which we renamed here to 'startAtClusterTime').

To Jeff's question, the semantics are to include any operations that happened exactly at the provided time. If I recall the decision making processes, it didn't seem to matter much so we went with 'include it'.

Finally, One point in response to Jeremy's concern is that we might expect a user to provide a timestamp gathered from the oplog (e.g. maybe they use 'the oldest thing in the oplog' or 'the time from the operation right before I created that collection').

Let me know if I missed anything.

Comment by Misha Tyulenev [ 03/May/18 ]

the server would then expect to access the "clusterTime" property of this opaque document and disregard any "signature" field (just as advanceClusterTime() does)

Just to clarify: the purpose of the signature is to validate that the clusterTime is in fact generated by the server to prevent malicious attacks where server's oplog optime is driven to the max and then server cant take any writes. Hence its not ignored but can be bypassed in some cases for performance.

Comment by Jeremy Mikola [ 03/May/18 ]

I think what the startAtClusterTime field should accept is an operationTime:

coll.aggregate([{'$changeStream': {'startAtClusterTime': {'ts': session.operation_time}}}])

I'd argue that the "ts" field is superfluous. As a user-facing API, it makes sense that the $changeStream option should take one of the following two types:

  • An opaque cluster time document ("blackbox" per Misha's last comment), which is accessible as the clusterTime property of a session and also accepted by the Session::advanceClustertime() method. In this case, the server would then expect to access the "clusterTime" property of this opaque document and disregard any "signature" field (just as advanceClusterTime() does).
  • An operation time BSON timestamp, which is accessible as the operationTime property of a session and also accepted by the Session::advanceOperationTime() method.

This all depends on how we expect users to provide this value. If we expect users to craft their own BSON timestamps or supply one from Session.operationTime, accepting a BSON timestamp type directly make sense. If we only expect users to feed in values from a Session's clusterTime property, than the opaque cluster time document makes sense.

I'd also like clarification on the semantics of this field. Is it similar to resumeAfter in that it includes only changes that occurred after the given timestamp? If so, use the word after in the field name rather than at, as at implies that it includes changes that occurred at a time equal to the given timestamp.

I concur. When a few of us talked this through yesterday I understood that startAtClusterTime was basically a stand-in for a resume token when a change stream has not yet seen any change documents (and thus hasn't yet captured a resume token). I think "after" makes more sense in that context.

Comment by Misha Tyulenev [ 03/May/18 ]

I stick to my suggestion that drivers should treat $clusterTime as a blackbox object and only pass it so we will have more flexibility to make server changes without backwards compatibility dependencies. I think charlie.swanson can address change stream semantics. My take on all the afterClusterTime field variants (startAtClusterTime, atClusterTime etc) - use operationTime derived values as it was added to indicate when the last command has been performed so in the case of afterClusterTime the secondary or primary on other shard wait only as long as needed. 

Comment by Jeffrey Yemin [ 03/May/18 ]

I'd also like clarification on the semantics of this field. Is it similar to resumeAfter in that it includes only changes that occurred after the given timestamp? If so, use the word after in the field name rather than at, as at implies that it includes changes that occurred at a time equal to the given timestamp.

Comment by Shane Harvey [ 03/May/18 ]

Copying my suggestions from SPEC-1057.

In SERVER-31916, misha.tyulenev said that he does not want drivers peeking inside the contents of the $clusterTime document. Has that changed?

I think what the startAtClusterTime field should accept is an operationTime:

>>> coll.aggregate([{'$changeStream': {'startAtClusterTime': {'ts': session.operation_time}}}])
<pymongo.command_cursor.CommandCursor object at 0x1035e1310>

Instead of a $clusterTime:

>>> coll.aggregate([{'$changeStream': {'startAtClusterTime': session.cluster_time}}])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/shane/Library/Python/2.7/lib/python/site-packages/pymongo/collection.py", line 2181, in aggregate
    **kwargs)
  File "/Users/shane/Library/Python/2.7/lib/python/site-packages/pymongo/collection.py", line 2088, in _aggregate
    client=self.__database.client)
  File "/Users/shane/Library/Python/2.7/lib/python/site-packages/pymongo/pool.py", line 496, in command
    collation=collation)
  File "/Users/shane/Library/Python/2.7/lib/python/site-packages/pymongo/network.py", line 125, in command
    parse_write_concern_error=parse_write_concern_error)
  File "/Users/shane/Library/Python/2.7/lib/python/site-packages/pymongo/helpers.py", line 146, in _check_command_response
    raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: BSON field '$changeStream.startAtClusterTime.clusterTime' is an unknown field.

I propose we rename the field to startAtOperationTime since it does not accept a cluster time. I think this makes sense because the operationTime field is already documented as corresponding to the timestamp of an oplog entry: https://docs.mongodb.com/manual/reference/method/db.runCommand/#command-response

Generated at Thu Feb 08 04:37:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.