[SERVER-84974] Investigate $geoNear blocking sort Created: 29/Jun/21 Updated: 12/Jan/24 Resolved: 22/Jul/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | David Percy | Assignee: | David Percy |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Sprint: | Query Optimization 2021-07-12, Query Optimization 2021-07-26 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
We think $geoNear already has to do some in-memory sorting. (It scans progressively larger annuli from the index, but it must sort the points within each annulus.) Is this something we can expose easily, to allow $geoNear to run anywhere in a pipeline? |
| Comments |
| Comment by David Storch [ 19/Jul/21 ] |
Ah yeah, I was talking about the match expressions. You are right that the agg stage does not have the same problem. One side note: I'm pretty sure that $geoNear is also a MatchExpression and is a synonym for $near/$nearSphere. It shows up here in the MatchExpression parser.
That makes sense to me. Internally geoNear is all about the actual runtime algorithm. But from the user's perspective, we should be able to convert it into a regular geo match expression followed by a sort. |
| Comment by David Percy [ 09/Jul/21 ] |
|
Thanks for the feedback! It sounds like blocking or top-k sort is a plausible query plan, but the question is whether we should express it with a separate $sort stage or not.
I think you're talking about the $near and $nearSphere match expressions here? It's confusing because internally they're spelled GEO_NEAR. I agree it's weird for a MatchExpression to imply a sort. By $geoNear I meant the $geoNear aggregation stage, so it's not weird in the same way. Luckily $near and $nearSphere are not allowed in a $match stage. Your suggestion about $geoWithin + $sort is interesting. I think we still want the user to be able to write $geoNear (to make time-series collections act as much like a regular collection as we can), so maybe we would rewrite a $geoNear to $geoWithin + $sort if we can't push it down. |
| Comment by David Storch [ 09/Jul/21 ] |
|
david.percy I don't know the exact history around why $geoNear was implemented in such a special way. What I can say is that the $geoNear operator has always been super tied to the specific algorithm used to implement it. Making it more of a logical operation rather than a way for the user to say "search ever-larger annuli and sort by distance from the center within each annulus" would be nice, but I'm unsure if that's the easiest way for us to satisfy the use case. It's always been kinda weird that $geoNear is a match expression but implies a sort... I know that we have discussed something like SERVER-37043 in the past, where we extend $geoWithin to be able to express ring shapes. And then we would have to fully expose a way to explicitly sort in the query by the geo distance metadata. That might prove to be an easier solution to implement, and it would get us away from the awkward match+sort semantics of $geoNear, as well as having to change the current position requirements around $geoNear. |
| Comment by David Percy [ 07/Jul/21 ] |
|
david.storch do you have any thoughts on this? I'm assuming that the reason we never implemented this, is that sorting the whole collection would be a bad query plan. And maybe geo collections tend to be big. So we decided it's better to raise an error rather than pick such a bad plan. Is this right? Were there other reasons? If the pipeline has a limit then we could do a top-k sort. And if the $geoNear has a maxDistance then a blocking sort might not be so bad, if the number of documents is small. I'm interested in this because I think it would be much simpler than a progressive sort, for time-series measurements. I'd like $geoNear to be a normal stage, like $sort, that can appear anywhere in a pipeline. Then we can do rewrites like: The $geoIntersects would use an index to pick only a few buckets, then the $geoNear would do a (hopefully small) blocking sort. |