[SERVER-34687] Aggregation Index Performance Created: 26/Apr/18  Updated: 21/May/18  Resolved: 21/May/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Daniel Polistchuck Assignee: Asya Kamsky
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File explains.txt    
Issue Links:
Related
related to SERVER-34926 allow $expr with comparison bounded o... Closed
Operating System: ALL
Steps To Reproduce:

The following aggregation:

db.Collection.aggregate([
    {
        $match: {
            "SimulationDate": {$gte: ISODate("2018-03-21 00:00:00")},
            "ParentId": null
        }
    },
    {
        $project: {
            "CPF": "$User.CPF",
            "Data": "$SimulationDate",
            "Email": "$User.Email",
            "TelCelular": {$concat: ["$User.MobilePhone.AreaCode","-","$Borrower.MobilePhone.Number"]},
            "TelOutro": {$concat: ["$User.HomePhone.AreaCode","-","$Borrower.HomePhone.Number"]},
        }
    },
    {
        $lookup: {
            from: "Collection",
            let: {
                id: "$Email",
                data: "$Data"
            },
            pipeline: [
                {
                    $match: {
                        $expr: {$and: [
                            {$eq: ["$$id", "$User.Email"]},
                            {$gte: ["$SimulationDate", ISODate("2018-03-21 00:00:00")]},
                            {$lte: ["$SimulationDate", "$$data"]},
                        ]}
                    }
                },
                {
                    $project: {
                        "CPF_Email": "$User.CPF",
                        "Data_Email": "$SimulationDate"
                    }
                }
            ],
            as: "Email"
        }
    }
])

seems to run much slower (I couldn't even wait for the first line to return) than the (similar?) find+Javascript script:

var ds = new Date(2018,2,21)
db.Collection.find({"SimulationDate": {$gte: ds},"ParentId": null},{'SimulationDate':1,'User.Email':1}).forEach(p=>{
    db.Collection.find({'User.Email':p.User.Email,SimulationDate:{$gte:ds, $lte:p.SimulationDate}}).forEach(pb=>{
        print (pb.User.Email+","+pb.SimulationDate)
    })
})

Participants:

 Description   

It seems that Index usage during $lookup phases is not fully optimized for $pipeline+$expr lookups.



 Comments   
Comment by Asya Kamsky [ 21/May/18 ]

I'm closing this issue as incomplete because it appears that the exact same query plan is being used in both cases.

Feel free to reopen with additional details if there appears to be a problem here I've missed.

I've also linked a related ticket that would allow the query to use a compound index on the two fields if one was present SERVER-34926 which could potentially improve performance if this query in some cases.

Comment by Asya Kamsky [ 30/Apr/18 ]

So I see the external query in loop and lookup which uses $expr are both using the same index to lookup data:

			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"Borrower.Email" : 1
				},
				"indexName" : "Borrower.Email_1",
				"isMultiKey" : false,
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 1,
				"direction" : "forward",
				"indexBounds" : {
					"Borrower.Email" : [
						"[\"danielp@bompracredito.com.br\", \"danielp@bompracredito.com.br\"]"
					]
				}
			} 

It's not immediately clear to me why it would be significantly slower.

Could you run both to completion and then include total stats for how long it took?

Comment by Daniel Polistchuck [ 29/Apr/18 ]

This time, I didn't bother to 'obfuscate' collection and propery names. The query planner outputs were so large that I attached a file with all info to this issue (explains.txt)

Comment by Asya Kamsky [ 28/Apr/18 ]

In that case can you please provide the output of this query:

db.Collection.explain(true).find({'User.Email':<email>,SimulationDate:{$gte:<date1>, $lte:<date2>}})
// and 
db.Collection.explain(true).find({$expr:{$and:[
  {$eq: [<email>, "$User.Email"]},
  {$gte: ["$SimulationDate", <date1>]},
  {$lte: ["$SimulationDate",<date2>]}
]}})

In both cases please fill in some reasonable values for <email>, <date1> and <date2>

Comment by Daniel Polistchuck [ 28/Apr/18 ]

Sorry, I renamed FullProposals (our actual collection name) to "Collection"
(a generic name) in all instances but in the "find" example.

Comment by Asya Kamsky [ 28/Apr/18 ]

We have several tickets tracking improving performance of $lookup stages, however $expr introduced a difficulty I described (several comparison operators do not have exactly the same meaning in find vs. aggregation expression.

I'm trying to understand if there is a bug here (in the example you describe): your aggregation queries "Collection" and then does a $lookup from "Collection" - your JS on the other hand after query on "Collection" does a subsequent lookup (via find) on "FullProposals". Please clarify the intention of the aggregation so that I can ask you for correct "explain" output to diagnose this performance issue.

Comment by Daniel Polistchuck [ 27/Apr/18 ]

@Asya Kamsky, User.Email is a string property. From an end-user experience standpoint, aggregations + lookups should behave exactly as SQL INNER JOINS.

If indexes area available, they should be leveraged to achieve maximum performance.

More to the point, I am seeing several scenarios, as a MongoDB developer, where I am experiencing better performance using finds + JavaScript than with aggregations with lookups. The expectation is that aggregations should be much faster than finds+javascript, since there's no "chat" going on between the client and MongoDB.

That's especially frustrating for users coming from an SQL background (say, statisticians) that are used to pretty fast SQL joins. Not my case, but I am having to support those users.

Are there any plans to improve aggregation lookup performance?

Comment by Asya Kamsky [ 27/Apr/18 ]

danielp@bompracredito.com.br please note that the JS example is not 100% identical to $lookup with expressive pipeline due to greater-than semantics in find and aggregation being different. This is why find can use an index for inequality when aggregation $expr cannot. You can see SERVER-31804 for details about how the wrong results can be returned if an index is used for inequality.

In addition, even equality semantics are not the same in aggregation and find as find looks inside arrays and aggregation does not (on straight $eq), is "User.Email" field an array in your case? If not then then $lookup should still be relatively fast if there is an index to use for equality. You can see some of that discussion in SERVER-31760

Generated at Thu Feb 08 04:37:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.