[SERVER-41171] MongoDB $lookup using "$expr" is slow compared to simple $lookup Created: 16/May/19  Updated: 07/Apr/23

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 4.0.8
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Daniele Tassone Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 8
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-45326 Poor Aggregation Framework performanc... Closed
Related
related to SERVER-34927 allow localField and foreignField wit... Closed
Assigned Teams:
Query Execution
Participants:

 Description   

$lookup with let+pipeline performs slower than simple lookup.

In the MongoDB shell...

To build the data set:

db.User.deleteMany({});
db.Order.deleteMany({});
db.User.ensureIndex({anni: 1});
db.Order.ensureIndex({_userId: 1});for (var va2 = 0; va2 < 1000; va2++) {
        var myUserA = {};
        myUserA._id = new ObjectId();
        myUserA.nome = "DanieleA";
        myUserA.anni = va2;    var myUserB = {};
        myUserB._id = new ObjectId();
        myUserB.nome = "DanieleB";
        myUserB.anni = va2;
    
    db.User.insertOne(myUserA);
    db.User.insertOne(myUserB);    var myOrders = [];
    for (var va = 0; va < 100; va++) {
        var myOrder = {};
            myOrder.prezzo = Math.random() * 1 * 100;
            myOrder._userId = myUserB._id;
            myOrders.push (myOrder);
    }
    db.Order.insertMany (myOrders);
};

To test performance:

for (var tries = 0; tries < 10; tries++) {
  startDate = new Date();db.getCollection('User').aggregate([
      {
        $lookup:
        {
          from: "Order",
          localField: "_id",
          foreignField: "_userId",
          as: "users"
        }
      },
      { $count: "totalCount" }
  ]);print("Without pipeline " + (new Date() - startDate))
 
  // Lookup with pipeline
  startDate = new Date();db.getCollection('User').aggregate([
      {
        $lookup: {
          from: "Order",
          let: {
            idUtente: "$_id"
          },
          pipeline: [
            {
              $match: 
              {
                $expr: {
                  $eq: ["$_userId", "$$idUtente"]
                }
              }
            }
          ],
          as: "Orders"
        }
      },
      { $count: "totalCount" }
  ]);print("With pipeline " + (new Date() - startDate))
}

Output (a local reproduction):

Without pipeline 259
With pipeline 338
Without pipeline 234
With pipeline 342
Without pipeline 235
With pipeline 341
Without pipeline 231
With pipeline 368
Without pipeline 262
With pipeline 375
Without pipeline 270
With pipeline 414
Without pipeline 281
With pipeline 395
Without pipeline 266
With pipeline 406
Without pipeline 285
With pipeline 384
Without pipeline 262
With pipeline 383

Is this difference expected and/or can it be improved?

original description

Hi - i'm experiencing a slow performance issue  while using $lookup with an embedded "pipeline". There is no slow performance issue for $lookup that is not using "pipeline".

// SEED THE COLLECTIONS
// 1) Create collections
// 2) Create index

db.User.deleteMany({});
db.Order.deleteMany({});db.User.ensureIndex({anni: 1});
db.Order.ensureIndex({_userId: 1});for (var va2 = 0; va2 < 1000; va2++) {
    var myUserA = {};}
        myUserA._id = new ObjectId();
        myUserA.nome = "DanieleA";
        myUserA.anni = va2;    var myUserB = {};
        myUserB._id = new ObjectId();
        myUserB.nome = "DanieleB";
        myUserB.anni = va2;
    
    db.User.insertOne(myUserA);
    db.User.insertOne(myUserB);    var myOrders = [];
    for (var va = 0; va < 100; va++) {
        var myOrder = {};
            myOrder.prezzo = Math.random() * 1 * 100;
            myOrder._userId = myUserB._id;
            myOrders.push (myOrder);
    }
    db.Order.insertMany (myOrders);
}

// QUERY with $lookup is fast (0.06ms)

db.getCollection('User').aggregate([
    {
        $match: {
            anni: {$gt: 50}
        }
    },
    {
        $project: {
            anni: 1
        }
    },
    {
      $lookup:
      {
        from: "User",
        localField: "_id",
        foreignField: "_userId",
        as: "users"
      }
    },
    { $count: "totalCount" }
], {explain: false});

// QUERY with $lookup and pipeline is slow (0.54ms)

db.getCollection('User').aggregate([
    {
        $match: {
            anni: {$gt: 50}
        }
    },
    {
        $project: {
            anni: 1
        }
    },
    {
      $lookup: {
        from: "Order",
        let: {
          idUtente: "$_id"
        },
        pipeline: [
          {
            $match: 
            {
              $expr: {
                $eq: ["$_userId", "$$idUtente"]
              }
            }
          }
        ],
        as: "Orders"
      }
    },
    { $count: "totalCount" }
], {explain: false});

 // QUERY with $lookup and pipeline but without $count is fast (0.03ms)

db.getCollection('User').aggregate([
    {
        $match: {
            anni: {$gt: 50}
        }
    },
    {
        $project: {
            anni: 1
        }
    },
    {
      $lookup: {
        from: "Order",
        let: {
          idUtente: "$_id"
        },
        pipeline: [
          {
            $match: 
            {
              $expr: {
                $eq: ["$_userId", "$$idUtente"]
              }
            }
          }
        ],
        as: "Orders"
      }
    },
], {explain: false});

Replacing $count with $sort make the query still slow while using "lookup with pipeline".
No slow performance in the case of "lookup" with no-pipeline.

Basically this problem make the $lookup impossibile to be used for scenario where the lookup must use 2 fields to compute the join. In fact, in this scenario you must use lookup with "pipeline". But the poor performance described make the query really slow the not suitable for production.

 



 Comments   
Comment by Asya Kamsky [ 10/Sep/21 ]

I changed the name of the ticket since the issue is not additional pipeline stages but rather using $expr instead of localField/foreignField.

Note that as of 5.0 due to SERVER-34927 you can use regular $lookup with additional pipeline stages (if you need to add $project or $limit etc. Note that you have to have FCV set to "5.0" to use the new syntax if you upgraded to version 5.0.

Comment by Baskar K [ 05/Mar/21 ]

Facing the same issue in version 4.4.2.  Could you please share the status of this issue (whether its planned in upcoming release or not).

Comment by Annie Black [ 23/Feb/21 ]

Sorry!!

Comment by Dennis Hoefakker [ 18/Sep/20 ]

How can we give this more priority. I think it's very important.

The a big reason to use uncorrelated lookups is you can speedup the aggregation (a $projects results in lesser data out of the $lookup, lesser data that needs to go through the pipeline). Currently the "old" $lookup is way faster, but requires extra steps which takes time and results in "code" solutions.

 

Comment by Pablo Guerrero [ 13/Dec/19 ]

I have the same problem as @Daniele Tassone had. I'm trying to use a projection after making the relationship between two collections but It is really slower compare with normal $lookup. Could you please tell us about what is happening with $lookup with embedded pipeline?

 

Thanks so much for your work!

Comment by Eric Sedor [ 23/May/19 ]

daniele.tassone@tomadaproductions.it I've made some edits to the reproduction steps you provided, including removing extraneous pipeline stages and adding a timer. We can see the difference in performance you are pointing to.

While some performance change may be expected when using lookup with a pipeline, we are going to look into whether this magnitude of difference is expected given the data set you've assembled.

Importantly though, I believe removing the $count stage isn't a helpful signal for this test. Doing so causes a test to time how quickly the client receives the first page of cursor results, which is very fast compared to the time it takes for the server to consume all results for count purposes.

In the MongoDB shell...

To build the data set:

db.User.deleteMany({});
db.Order.deleteMany({});
db.User.ensureIndex({anni: 1});
db.Order.ensureIndex({_userId: 1});for (var va2 = 0; va2 < 1000; va2++) {
        var myUserA = {};
        myUserA._id = new ObjectId();
        myUserA.nome = "DanieleA";
        myUserA.anni = va2;    var myUserB = {};
        myUserB._id = new ObjectId();
        myUserB.nome = "DanieleB";
        myUserB.anni = va2;
    
    db.User.insertOne(myUserA);
    db.User.insertOne(myUserB);    var myOrders = [];
    for (var va = 0; va < 100; va++) {
        var myOrder = {};
            myOrder.prezzo = Math.random() * 1 * 100;
            myOrder._userId = myUserB._id;
            myOrders.push (myOrder);
    }
    db.Order.insertMany (myOrders);
};

To test performance:

for (var tries = 0; tries < 10; tries++) {
  startDate = new Date();db.getCollection('User').aggregate([
      {
        $lookup:
        {
          from: "Order",
          localField: "_id",
          foreignField: "_userId",
          as: "users"
        }
      },
      { $count: "totalCount" }
  ]);print("Without pipeline " + (new Date() - startDate))
 
  // Lookup with pipeline
  startDate = new Date();db.getCollection('User').aggregate([
      {
        $lookup: {
          from: "Order",
          let: {
            idUtente: "$_id"
          },
          pipeline: [
            {
              $match: 
              {
                $expr: {
                  $eq: ["$_userId", "$$idUtente"]
                }
              }
            }
          ],
          as: "Orders"
        }
      },
      { $count: "totalCount" }
  ]);print("With pipeline " + (new Date() - startDate))
}

Output (a local reproduction):

Without pipeline 259
With pipeline 338
Without pipeline 234
With pipeline 342
Without pipeline 235
With pipeline 341
Without pipeline 231
With pipeline 368
Without pipeline 262
With pipeline 375
Without pipeline 270
With pipeline 414
Without pipeline 281
With pipeline 395
Without pipeline 266
With pipeline 406
Without pipeline 285
With pipeline 384
Without pipeline 262
With pipeline 383

Comment by Eric Sedor [ 17/May/19 ]

Thanks daniele.tassone@tomadaproductions.it; We will review this report and let you know if we have any questions to help clarify.

Generated at Thu Feb 08 04:57:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.