[SERVER-49461] Slow performance when doing lookup in small sets Created: 11/Jul/20  Updated: 27/Oct/23  Resolved: 05/Aug/20

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 4.2.6, 4.2.8, 4.4.0-rc12
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Bernard Grymonpon Assignee: Asya Kamsky
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File dataset.tgz     File diagnostic.data.tgz     HTML File gdb.html     Text File gdb_sort.txt     PNG File tree.png    
Issue Links:
Related
related to SERVER-21284 $lookup should cache query results Backlog
related to SERVER-21312 $lookup should batch query requests Backlog
Participants:

 Description   

When doing an aggregation from a somehow larger collection (260k documents in my case), having one or more lookup stage(s) towards small collections (94 and 207 documents in my case), the performance further in the pipeline is slow (doing a sort on an embedded key from the lookup, or anything else (big skip, rare match,...) which forces a full pass to happen).

In my example (see below for exact pipeline and data), we start from 260k documents, lookup in two very small other collections (on an indexed field), project to collapse a embedded key from the lookup tables and then sort on this projected field. This runs for 53 seconds.

After analysing this for a while, and looking at mongotop, i notices that most time is spent reading from the lookup-tables. However, even when reducing the dataset to a couple of megabytes and making sure the WT cache is certainly big enough (1.5G) to hold everything (data, indexes and intermediate results), I cannot get my head around it why this is this slow ... I understand it has to do 260k lookups, twice (so, 500k-ish) due to the double lookup, but these happen on an indexed field in a dataset of minimal size.

I timed doing 1 million simple finds on the indexed field in the larger of the two joined collections, and this takes about 8 seconds (which is double the amount of what should be done).

I realise that doing lookups is expensive, but it feels like there is something going wrong in the lookup phase - this should be an all in-memory operation.

The indices on the lookup tables are used in the aggregation pipeline, as without them, it takes +140 seconds. The "delete" match at the beginning doesn't reduce the dataset a lot (approx 1%), leaving that out makes little difference.

I tried on 4.2.6, 4.2.8 and the 4.4.0-rc.12, all results are equal, give or take a few tenths of a second. All done in docker for mac (recent macbook pro), 2GB RAM, 1.5GB assigned as WT cache, nothing else running. We see this as well on 64GB Google-Cloud SSD-backed nodes, and equal Aliyun nodes, WT cache set to 70%).

Datamodel examples:

Alert:

 

{
    "_id" : ObjectId("5e154fc78dde24bb1eb2277d"),
    "item" : {
        "model" : {
            "urn" : "alertmodel:5e15375d0bb3bf0001c8454a:v3"
        }
    },
    "deleted" : false
}

Index on `deleted`, `item.model.urn`.

 

AlertModel:

 

{
    "_id" : ObjectId("5e1534178dde24bb1eac3bc4"),
    "item" : {
        "name" : {
            "text" : "Sample data"
        }
    },
    "urn" : "alertmodel:5e153417222cc60001b4eb3c:v18"
}

index on `urn`

 

AlertModelVersions:

 

{
    "_id" : ObjectId("5e1540578dde24bb1eacc0ed"),
    "item" : {
        "name" : {
            "text" : "Sample data"
        }
    },
    "urn" : "alertmodel:5e15375d0bb3bf0001c8454a:v1"
}

 index on `urn`.

Aggregation pipeline:

 

 

db.Alert.aggregate([db.Alert.aggregate([ 
{ "$match" : { "deleted" : false } }, 
{ "$lookup" : { 
  "localField" : "item.model.urn", 
  "as" : "joined.item.model", 
  "from" : "AlertModel", 
  "foreignField" : "urn" } }, 
{ "$lookup" : { 
  "localField" : "item.model.urn", 
  "as" : "versionJoined.item.model", 
  "from" : "AlertModelVersions", 
  "foreignField" : "urn" } }, 
{ "$project" : { 
  "joined.item.model" : { 
    "$arrayElemAt" : [ { "$map" : { "input" : { "$concatArrays" : [ "$joined.item.model", "$versionJoined.item.model" ] }, "as" : "object", "in" : "$$object" } },
    0 ] }, 
} },      
{ "$sort" : { "joined.item.model.item.name.text" : -1 } },   
{ "$limit" : 50 }, 
], 
{"allowDiskUse": true})

Explain:

{
    "stages" : [ 
        {
            "$cursor" : {
                "queryPlanner" : {
                    "plannerVersion" : 1,
                    "namespace" : "atl.Alert",
                    "indexFilterSet" : false,
                    "parsedQuery" : {
                        "deleted" : {
                            "$eq" : false
                        }
                    },
                    "queryHash" : "3BDEB6AE",
                    "planCacheKey" : "71819335",
                    "winningPlan" : {
                        "stage" : "PROJECTION_DEFAULT",
                        "transformBy" : {
                            "_id" : 1,
                            "item.model.urn" : 1,
                            "joined" : 1,
                            "versionJoined.item.model" : 1
                        },
                        "inputStage" : {
                            "stage" : "FETCH",
                            "inputStage" : {
                                "stage" : "IXSCAN",
                                "keyPattern" : {
                                    "deleted" : 1.0,
                                    "item.model.urn" : 1.0
                                },
                                "indexName" : "deleted_1_item.model.urn_1",
                                "isMultiKey" : false,
                                "multiKeyPaths" : {
                                    "deleted" : [],
                                    "item.model.urn" : []
                                },
                                "isUnique" : false,
                                "isSparse" : false,
                                "isPartial" : false,
                                "indexVersion" : 2,
                                "direction" : "forward",
                                "indexBounds" : {
                                    "deleted" : [ 
                                        "[false, false]"
                                    ],
                                    "item.model.urn" : [ 
                                        "[MinKey, MaxKey]"
                                    ]
                                }
                            }
                        }
                    },
                    "rejectedPlans" : []
                }
            }
        }, 
        {
            "$lookup" : {
                "from" : "AlertModel",
                "as" : "joined.item.model",
                "localField" : "item.model.urn",
                "foreignField" : "urn"
            }
        }, 
        {
            "$lookup" : {
                "from" : "AlertModelVersions",
                "as" : "versionJoined.item.model",
                "localField" : "item.model.urn",
                "foreignField" : "urn"
            }
        }, 
        {
            "$project" : {
                "_id" : true,
                "joined" : {
                    "item" : {
                        "model" : {
                            "$arrayElemAt" : [ 
                                {
                                    "$map" : {
                                        "input" : {
                                            "$concatArrays" : [ 
                                                "$joined.item.model", 
                                                "$versionJoined.item.model"
                                            ]
                                        },
                                        "as" : "object",
                                        "in" : "$$object"
                                    }
                                }, 
                                {
                                    "$const" : 0.0
                                }
                            ]
                        }
                    }
                }
            }
        }, 
        {
            "$sort" : {
                "sortKey" : {
                    "joined.item.model.item.name.text" : -1
                },
                "limit" : NumberLong(50)
            }
        }
    ],
    "serverInfo" : {
        "host" : "9344216bd23c",
        "port" : 27017,
        "version" : "4.4.0-rc12",
        "gitVersion" : "4b96db94125b5e0ca1da0c91f09a5fca8c94d9ab"
    },
    "ok" : 1.0
}

I can give a complete dataset if you want (1.7 MB compressed mongodump). Let me know if you need more info.

(I've left the priority field below at the default level)



 Comments   
Comment by Asya Kamsky [ 05/Aug/20 ]

We currently know there is a lot of ways that $lookup performance could be optimized. Your issue is similar to possible optimizations described in SERVER-21312 and SERVER-21284. At the moment $lookup is rather naive in its execution and repeats a lot of query planning for every single document in the pipeline therefore even though the actual lookup itself is very cheap, the extra overhead of doing it over and over again makes it very expensive when the collection being aggregated is large.

I wonder if there might be a better/faster approach to what you are trying to do since you only seem to need one match from (I guess `urn` is unique?)

I'm going to close this ticket as the work to improve this performance is tracked in the two tickets I mentioned. Meanwhile if you could figure out how to move your {{ "$sort" }} and { { "$limit" : 50 }} before the lookups it would significantly reduce the running time of this query. Another option would be to materialize the result into another collection querying which would not need any $lookup stages.

Comment by Jonathan Streets (Inactive) [ 29/Jul/20 ]

hi bernard.grymonpon@oqton.com,
Thank you for the detailed example. I have been able to reproduce the issue you are seeing. We're assigning this ticket to the appropriate team to be evaluated against our currently planned work. Updates will be posted on this ticket as they happen.
Jon

Comment by Bernard Grymonpon [ 17/Jul/20 ]

Jonathan,

I just restarted the docker (-rc.12), loaded the dataset (mongorestore), upped the WT cache (db.adminCommand( { "setParameter": 1, "wiredTigerEngineRuntimeConfig": "cache_size=1500M"})) ran the query once (took 45 seconds) and then copied the diagnostics. 

I'll attach it, together with the dataset.

Comment by Jonathan Streets (Inactive) [ 17/Jul/20 ]

hi Bernard,

thankyou for the info.   in order for us to investigate this further, please could you send us the FTDC data for one of the affected versions.
please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket?

Also the sample dataset will help immensely, so could you attach that too?

many thanks

jon

 

 

Generated at Thu Feb 08 05:19:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.