[SERVER-55576] Optimize queries on time-series collections which request the most recent value Created: 26/Mar/21  Updated: 09/Jun/21  Resolved: 09/Jun/21

Status: Closed
Project: Core Server
Component/s: Query Planning
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Charlie Swanson Assignee: Ruslan Abdulkhalikov (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File g1-1.png    
Issue Links:
Related
is related to SERVER-55106 Map predicates on max time to a porti... Closed
is related to SERVER-4507 aggregation: optimize $group to take... Backlog
is related to SERVER-9507 Optimize $sort+$group+$first pipeline... Closed
is related to SERVER-37304 Extend $sort+$group+$first pipeline o... Closed
Sprint: Query Optimization 2021-04-05, Query Optimization 2021-04-19, Query Optimization 2021-05-03, Query Optimization 2021-05-17, Query Optimization 2021-05-31, Query Optimization 2021-06-14
Participants:

 Description   

There are a couple ways to write a similar query by flipping the sort and $first/$last, but consider this example:

db.my_timeseries.aggregate([
  {$sort: {ts: 1}},
  {$group: {_id: "$meta.x", most_recent_foo: {$last: "$foo"}}}
])

This is the same pattern as described in SERVER-9507, and we should be able to have this query do something like a DISTINCT_SCAN if one of the following indexes exist:

{ts: +/-1, meta: +/-1}
{meta: +/-1, ts: +/-1}
{_id: +/-1, meta: +/-1}
{meta: +/-1, _id: +/-1}

The last two would probably be pretty challenging to implement and would have to do some similar analysis to SERVER-55106 in order to translate the "ts" predicate/scan into something on _id. It may not even be possible.

We could also instead consider transforming the query not into a distinct scan but into a reverse _id scan to ensure whatever we find is most recent, and performing a streaming $group implementation (SERVER-4507). This is generally a hard operation to perform, but within the context of a time-series collection it might be easier to prove that the optimization is correct.


Generated at Thu Feb 08 05:36:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.