[SERVER-32220] Allow index use for all aggregation pipeline $match and possibly $filter stages instead of just the first stage. Created: 08/Dec/17  Updated: 11/Jan/18  Resolved: 11/Dec/17

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Andrew Harris Assignee: Mark Agarunov
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

Not sure if this is possible but, as I understand it, indexes are only used for the first $match stage of a pipeline and then a collection scan for all the others after that.

If possible it might be beneficial to use indexes for all $match stages and possibly $filter expressions too as they themselves can be quite powerful.



 Comments   
Comment by Andrew Harris [ 20/Dec/17 ]

Thanks for that Mark, I guess it is a question about performance really and more specifically around handling bucketed data. In our application, the first $match is needed to find the relevant buckets which, after some projection (including a $filter), a $unwind is used to split the remaining array of embedded documents into distinct documents that are then filtered again using $match. Depending on the parameters this could very well result in a secondary $match against many thousands of "stage" documents (output from the $unwind) and so I had concerns over the performance of this and ways to optimise it.

Comment by Mark Agarunov [ 11/Dec/17 ]

Hello aharris,

Thank you for the report. You are correct that two $match stages with a different stage between them would only use the index on the first $match. However a $match followed by a $match would use the index for both stages. The reason for this behavior is that there is no way to know the structure of data ahead of time as other stages in the pipeline could modify it, so there would be no way to keep an index of that data. Additionally, the collection scan you see is not performed on the collection itself, but on the output of the previous pipeline stage, only the initial match is against the collection itself, which should use an index scan if one is available.

Thanks,
Mark

Generated at Thu Feb 08 04:29:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.