[CSHARP-4744] Improve optimization of Count with predicate in Group Created: 03/Aug/23  Updated: 09/Aug/23  Resolved: 09/Aug/23

Status: Closed
Project: C# Driver
Component/s: LINQ3
Affects Version/s: 2.20.0
Fix Version/s: 2.21.0

Type: Improvement Priority: Unknown
Reporter: Alistair Steele Assignee: Robert Stam
Resolution: Done Votes: 0
Labels: LINQ3
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Documentation Changes: Not Needed
Documentation Changes Summary:

1. What would you like to communicate to the user about this feature?
2. Would you like the user to see examples of the syntax and/or executable code and its output?
3. Which versions of the driver/connector does this apply to?


 Description   

A conditional count within a GroupBy aggregation is not currently optimised by the AstOptimizer, resulting in a $group that causes all documents to be pushed and then a following $project stage.

This has a significant impact on grouping, particularly when the group is actually just leveraging an index. With the unoptimised post-$group $project stage, the entire document set has to be collated, whereas with the $project optimised into the group (and with an appropriate index on the collection) the $group becomes non-blocking, performing at basically the same speed regardless of collection size.

The following demonstrates the difference in queries
 

No filter expression: 
 
db.Foo.Aggregate([]).GroupBy(x => x.FooName, (x, y) => new Summary() {FooName = x, Count = y.Count()})
 
Optimised query:
 
 db.Foo.Aggregate([{ "$group" : { "_id" : "$FooName", "__agg0" : { "$sum" : 1 } } }, { "$project" : { "FooName" : "$_id", "Count" : "$__agg0", "_id" : 0 } }])
 
Filtered expression: 
 
db.Foo.Aggregate([]).GroupBy(x => x.FooName, (x, y) => new Summary() {FooName = x, Count = y.Count(x => (Convert(x.State, Int32) == 1))})
 
Unoptimised query:
 
 db.Foo.Aggregate([{ "$group" : { "_id" : "$FooName", "_elements" : { "$push" : "$$ROOT" } } }, { "$project" : { "FooName" : "$_id", "Count" : { "$size" : { "$filter" : { "input" : "$_elements", "as" : "x", "cond" : { "$eq" : ["$$x.State", 1] } } } }, "_id" : 0 } }])

As you can see, once we make the Count filtered it fails to optimise and falls back to a full $push: "$$ROOT".

I think this can be optimised by converting the expression from a $size to a:

$sum: {
    $cond: [
        { $eq: ["$State", 1]},
        1,
        0
    ]
}

I had a scan of the source and I think this would be done in AstGroupingPipelineOptimizer.AccumulatorMover.VisitUnaryExpression.TryOptimizeSizeOfElements. It looks like this currently only optimizes counts for the entire _element set.

Demo source code:
 

using MongoDB.Driver;
using MongoDB.Driver.Linq;
using static System.Linq.Queryable;
 
public class Program
{
    public enum State
    {
        Started,
        Running,
        Complete
    }    
 
    public class Foo
    {
        public string FooName;
        public State State;
    }    
 
    public class Summary
    {
        public string FooName;
        public int Count;
    }    
 
    public static void Main(string[] args)
    {
        // Setup 
        string uri = "mongodb://localhost:27017";
        MongoClientSettings settings = MongoClientSettings.FromConnectionString(uri);
        settings.LinqProvider = LinqProvider.V3;
        MongoClient client = new(settings: settings);
        IMongoDatabase database = client.GetDatabase("db");
        IMongoCollection<Foo> foos = database.GetCollection<Foo>("Foo");
        foos.DeleteMany(x => true);        
 
        // Test data
        foos.InsertOne(new() { FooName = "foo1", State = State.Started });
        foos.InsertOne(new() { FooName = "foo1", State = State.Running });
        foos.InsertOne(new() { FooName = "foo2", State = State.Running });
        foos.InsertOne(new() { FooName = "foo2", State = State.Running });
        foos.InsertOne(new() { FooName = "foo2", State = State.Complete });
        foos.InsertOne(new() { FooName = "foo3", State = State.Complete });        
        // Queryable
        IQueryable<Foo> foosQ = foos.AsQueryable();        
 
        // No filter, just count
        var countByName = foosQ.GroupBy(x => x.FooName, (x, y) => new Summary()
        {
            FooName = x,
            Count = y.Count()
        });        
 
        // Filter by running
        var runningByName = foosQ.GroupBy(x => x.FooName, (x, y) => new Summary()
        {
            FooName = x,
            Count = y.Count(x => x.State == State.Running)
        });        
 
        Console.WriteLine($"No filter expression: {countByName.Expression}");
        Console.WriteLine($"Optimised query = {countByName}");
        Console.WriteLine();
        Console.WriteLine($"Filtered expression: {runningByName.Expression}");
        Console.WriteLine($"Unoptimised query = {runningByName}");     }
} 

 

 



 Comments   
Comment by Githook User [ 09/Aug/23 ]

Author:

{'name': 'rstam', 'email': 'robert@robertstam.org', 'username': 'rstam'}

Message: CSHARP-4744: Improve optimization of Count with predicate in Group.
Branch: master
https://github.com/mongodb/mongo-csharp-driver/commit/208438fe842bc041ad652f23c5244d217773d7fd

Comment by Robert Stam [ 08/Aug/23 ]

Thanks for reporting this. I have a fix in code review.

Comment by Alistair Steele [ 03/Aug/23 ]

Apologies for the formatting mess, I can't seem to find the edit button either.

Edit: And the field in the $sum sample is meant to be $State, not $Change.

[robert@mongodb.com]: I edited the sample replacing $Change with $State

Comment by PM Bot [ 03/Aug/23 ]

Hi alistair.steele@trapdoorlabs.uk, thank you for reporting this issue! The team will look into it and get back to you soon.

Generated at Wed Feb 07 21:49:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.