-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: 3.6.0
-
Component/s: Aggregation Framework
-
None
-
Fully Compatible
-
ALL
-
v3.6
-
Query 2018-01-15, Query 2018-01-29
When resuming a change stream, we first need to make sure that the oplog has enough history to allow a resume. To do this, we make sure that the first thing in the stream is the resume token we expect. This works correctly in an unsharded environment, because we will start the oplog query with a ts: {$gte: <resume ts>} query, and each change's timestamp will be unique, so the first thing that comes back should be the change we're looking to resume after (if it's still possible to resume).
Things are a bit different in a sharded scenario. To start, we don't know which shard is going to have the resume token, so we need to perform the check against the first document after merging the results from each shard. Secondly, the timestamp of each change is not guaranteed to be unique in a sharded cluster, two changes can happen 'simultaneously' (with the same timestamp) on two different shards. In situations like this, it's possible and legal that the first thing in the change stream will not be the resume token we're looking for, but rather a change that preceded the change we're resuming after and happened to have the same timestamp.
To account for this, the logic that checks if we are able to resume (inside DocumentSourceEnsureResumeTokenPresent::getNext()) should allow arbitrarily many changes to occur before the resume token iff they have the same timestamp as the resume token. As soon as we see the resume token itself we know it's possible to resume, or if we see a change with a higher timestamp we know it's not possible to resume.