[SERVER-58473] Change stream startup with resumeToken uses COLLSCAN and is slow Created: 08/Jul/21 Updated: 27/Oct/23 Resolved: 08/Aug/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Question | Priority: | Minor - P4 |
| Reporter: | David Zhu | Assignee: | Bernard Gorman |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Sprint: | QE 2021-08-09, QE 2021-08-23 |
| Participants: | |
| Case: | (copied to CRM) |
| Description |
|
When resuming a change stream, there are expensive queries issued against the oplog using a `COLLSCAN`. Given that the oplog is naturally sorted, and given that the resume token encodes time info, why does Mongo need to start from the top of the oplog to find where the change event document? Shouldn't search on this sorted list be `O(log(n))`? My understanding is that the oplog is mainly for replication, and there are no indexes on the oplog. However, it does follow a sort order, so it seems Mongo can take advantage of this for faster lookups for the resume token. Is this just a limitation of the existing oplog architecture? Could this be supported in the future? Maybe what can help me better understand is: how are change stream `_id`s derived? Is the main reason for this limitation that we don't have _ids on the oplog? Even if that's a case, could we first identify the timestamp and then use that to find the respective op? |
| Comments |
| Comment by Bernard Gorman [ 08/Aug/21 ] | |||
|
Hi david@vanta.com, What you have described above is, in fact, exactly what change streams does when resuming from either a resume token or via startAtOperationTime - it will jump directly to that starting point in the oplog and beginning scanning forward in time from there. The COLLSCAN simply indicates that the query is not using an index; as you rightly point out, the oplog does not have any indexes in order to avoid write amplification during replication, but it is implicitly sorted by ts. The COLLSCAN does not, however, mean that the scan is starting from the beginning of the oplog. You can confirm this for yourself by issuing an explain on a change stream:
If you examine the output, you will see that the first stage of the pipeline is a $cursor stage with namespace local.oplog.rs. You will further observe a field in this stage which looks something like the following:
The minTs field indicates that the time-based optimization is being used for this query, and that we have skipped directly to this timestamp before continuing our forward scan. Any query on the oplog which includes a $gt, $gte, or equality predicate in the query will take advantage of this optimization. There is also a similar maxTs optimization for $lt and $lte predicates. Of course, if you are starting a change stream from a point in the very distant past of the oplog, then the stream can only skip a relatively small percentage of the total documents before scanning through the remainder. To guard against this, you should ensure that your application retrieves the latest resume token from the driver on every batch. Even if no new results are returned and the batch is empty, the resume token will continue to advance to reflect the latest point in the oplog, meaning that your resume point should always be near the end. Hope this helps! Best regards, | |||
| Comment by Joseph Dani [ 13/Jul/21 ] | |||
|
Hi david@vanta.com - The Data Architecture (DARCH) team handles questions for our internal data platform. I believe this question is meant for the Product team, so moving over to SERVER Jira project. |