[SERVER-55612] Improve handling of $geonear+$or when all $or clauses involve the same $near key Created: 30/Mar/21  Updated: 03/Feb/22  Resolved: 03/Feb/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.0
Fix Version/s: None

Type: Improvement Priority: Trivial - P5
Reporter: Al Han Assignee: Svilen Mihaylov (Inactive)
Resolution: Won't Fix Votes: 0
Labels: query-director-triage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongodb-explain-1.txt    
Issue Links:
Duplicate
duplicates SERVER-13974 Support $near in $or Closed
Related
related to SERVER-48120 Optimize a $sort after a $unionWith Backlog
Participants:

 Description   

I'm using MongoDB 4.4 pipeline aggregate to retrieve documents sorted by distance from a collection of around 1.5M documents. While I'm only retrieving 12 documents at the time, I CANNOT limit the maxDistance used as I'm sorting and paging results.

The following query is very slow (unless the second argument in the $or sub-query is removed). The culprit is clearly the $or sub-query. Running the $or sub-query independently yields very fast results so I suspect it has to do with the combination of $geonear and $or.

I have created indices created for location, reference and _id of course. I also have composite indices for all three

db.mydocs.aggregate([{ "$geoNear": { "query": { "$or": [{ "_id": { "$in": ["2cf98ef9-419b-4e77-b556-631837d429dc"] } }, { "reference": "6013eb0ab006f82c6bf54003" } ] }, "maxDistance": 3.1357118857747963, "distanceMultiplier": 6378.137, "near": [-0.12775, 51.50735], "spherical": true, "distanceField": "distance", "key": "location" } }, { "$skip": 0 }, { "$limit": 12 } ])

I cannot understand why this query is slow when I have created the appropriate indexes. Using explain() shows that a composite index is being used but the query is still too slow.

This was also report on https://stackoverflow.com/questions/66576357/mongodb-aggregate-geonear-function-very-slow-when-using-with-or-sub-query

 



 Comments   
Comment by David Percy [ 03/Feb/22 ]

As Eric pointed out, theoretically we could create two index scans and merge sort them by geo distance. But I think this would be hard to do, because $near and $nearSphere are treated as such a special case. (And semantically they really are a special case, because they look like a filter, but are actually filter+sort.)

As a workaround, you could use $geoNear and $unionWith to get two index scans, and combine the results with $sort:

db.mydocs.aggregate([
  {$geoNear: { ... query: {_id: ...} ...}},
  {$unionWith: {
    coll: 'mydocs',
    pipeline: [
      {$geoNear: { ... query: {reference: ...} ...}},
    ]
  }},
  {$sort: {distance: 1}},
])

Today this will do a blocking sort, but if the predicates on '_id' and 'reference' are very selective then that may be fine. In the future we could optimize this to a merge sort: SERVER-48120. Also, if you have indexes on {_id: 1, location: '2dsphere'} and {reference: 1, location: '2dsphere'}, with the equality predicate before the geo range predicate, then each index scan should have good bounds.

Closing for now, but we could reconsider if this becomes a common request.

Comment by Eric Sedor [ 19/Aug/21 ]

To clarify my reasoning about SERVER-13974: I think one way performance could be improved for this query if we supported $near within $or and used a distance field returned by each $or stage to merge sort results. But, the query team will need to consider this and may have other ideas entirely.

Comment by Al Han [ 28/Apr/21 ]

Thanks Eric. I don't see how this is related to SERVER-13974 - 

Comment by Eric Sedor [ 27/Apr/21 ]

Thanks for your patience aphtechme@gmail.com. It looks like this issue is related to SERVER-13974. I'm going to pass this ticket to our query team for additional consideration.

Comment by Al Han [ 31/Mar/21 ]

Hi Eric,

I've attached a full output of explain("executionStats"). Let me know if there's anything else need. 

Thank you for your assistance,

Al

Comment by Eric Sedor [ 31/Mar/21 ]

Hi aphtechme@gmail.com, we can take a look. Could you please go ahead and provide the full output of explain("executionStats")?

Comment by Al Han [ 31/Mar/21 ]

Part of WinningPlan

 

 "winningPlan" : { "winningPlan" : { "stage" : "FETCH", "filter" : { "$or" : [ { "_id" : { "$eq" : "2cf98ef9-419b-4e77-b556-631837d429dc" } }, { "reference" : { "$eq" : "6013eb0ab006f82c6bf54003" } } ] }, "inputStage" : { "stage" : "GEO_NEAR_2DSPHERE", "keyPattern" : { "location" : "2dsphere", "_id" : 1, "reference" : 1 }, "indexName" : "location_id_reference", "indexVersion" : 2, "inputStages" : [ { "stage" : "FETCH", "inputStage" : { "stage" : "IXSCAN", "keyPattern" : { "location" : "2dsphere", "_id" : 1, "reference" : 1 }, "indexName" : "location_id_reference", "isMultiKey" : false, "multiKeyPaths" : { "location" : [ ], "_id" : [ ], "reference" : [ ] }, "isUnique" : false, "isSparse" : false, "isPartial" : false, "indexVersion" : 2, "direction" : "forward", "indexBounds" : { "location" : [ "[5206161169240293376, 5206161169240293376]", "[5219671968122404864, 5219671968122404864]", "[5220797868029247488, 5220797868029247488]", "[5221365216029179904, 5221365216029179904]",

Comment by Al Han [ 30/Mar/21 ]

Here are some output from explain():

"keysExamined":524854,"docsExamined":1037474,"cursorExhausted":true,"numYields":1048,"nreturned":9

Also seems like the correct index containing (location,_id,reference) is being used in 

winningPlan

Reducing the distanceMultiplier to 1 has not performance improvement impact whatsoever. 

Generated at Thu Feb 08 05:36:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.