[SERVER-66072] $match sampling and $group aggregation strange behavior Created: 29/Apr/22  Updated: 29/Oct/23  Resolved: 11/Jul/22

Status: Closed
Project: Core Server
Component/s: Query Execution, Query Planning
Affects Version/s: 5.0.8, 5.0.6
Fix Version/s: 6.0.1, 5.0.11, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Johan Maupetit Assignee: Alya Berciu
Resolution: Fixed Votes: 0
Labels: aggregation, group, sample
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Problem/Incident
is caused by SERVER-39938 aggregation $match before $lookup opt... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.0, v5.3, v5.0
Sprint: QE 2022-06-13, QO 2022-07-11, QO 2022-07-25
Participants:

 Description   

I'm using mongodb aggregation pipeline with $sampleRate in order to improve my query performances. I felt on a strange behavior i don't understand ...

Here is my aggregation pipeline running on a big collection (1M+ documents) :

 

 [
      {
        '$match': {
          publishedAt: {
            '$gt': new Date('2021-04-27T22:00:00.000Z'),
            '$lt': new Date('2022-04-28T21:59:59.999Z')
          },
          //... some other matching fields
        }
      },
      {
        '$group': {
          _id: {
            keyWords: '$keyWords', // This is an Array<String>
            //... some other fields
          },
          first: { '$first': '$$CURRENT' }
        }
      },
      { '$match': { '$sampleRate': 0.25 } }, // This is where i do my sampling
      { '$replaceRoot': { newRoot: '$first' } },
      {
        '$project': {
          _id: true,
          //... some other fields
        }
      }
    ] 

When i do this i get approximately two times more documents than when i inverse the $replaceRoot and $sampleRate steps =>

  

  [
      {
        '$match': {
          publishedAt: {
            '$gt': new Date('2021-04-27T22:00:00.000Z'),
            '$lt': new Date('2022-04-28T21:59:59.999Z')
          },
          //... some other matching fields
        }
      },
      {
        '$group': {
          _id: {
            keyWords: '$keyWords', // This is an Array<String>
            //... some other fields
          },
          first: { '$first': '$$CURRENT' }
        }
      },
      { '$replaceRoot': { newRoot: '$first' } },
      { '$match': { '$sampleRate': 0.25 } }, // This is where i do my sampling
      {
        '$project': {
          _id: true,
          //... some other fields
        }
      }
    ]

... I don't understand why oO They should give the same number of documents to me.

Do you know where i'm failing to understand ? Or is it a bug ?

PS : I created a question here : https://stackoverflow.com/questions/72048023/mongodb-aggregate-pipeline-sampling-fail



 Comments   
Comment by Githook User [ 29/Jul/22 ]

Author:

{'name': 'Alya Berciu', 'email': 'alya.berciu@mongodb.com', 'username': 'alyacb'}

Message: SERVER-66072 Fix dependency analysis for $match on $expr with $rand
Branch: v5.0
https://github.com/mongodb/mongo/commit/f5d28a6e731319f7ef6c2745df65e62fdce885f6

Comment by Githook User [ 28/Jul/22 ]

Author:

{'name': 'Alya Berciu', 'email': 'alya.berciu@mongodb.com', 'username': 'alyacb'}

Message: SERVER-66072 Fix dependency analysis for $match on $expr with $rand
Branch: v6.0
https://github.com/mongodb/mongo/commit/1021a18b98f5fdc6feb74080f11926efc4822471

Comment by Alya Berciu [ 11/Jul/22 ]

Closing as the fix has been merged to master.

Comment by Githook User [ 11/Jul/22 ]

Author:

{'name': 'Alya Berciu', 'email': 'alya.berciu@mongodb.com', 'username': 'alyacb'}

Message: SERVER-66072 Fix dependency analysis for $match on $expr with $rand
Branch: master
https://github.com/mongodb/mongo/commit/e8e47a527f3f991ee438f78d3c3cd4e581cae044

Comment by Johan Maupetit [ 19/Jun/22 ]

Thank you mongo team for the informations, have a good luck for fixing this one

Comment by Zixuan Zhuang [ 09/Jun/22 ]

Thank you cjbjohan.maupetit@laposte.net for filing this bug ticket. We've determined it was caused by SERVER-39938 and affects v5.0. The Query Optimization team will be working on the bug fix.

Comment by Zixuan Zhuang [ 01/Jun/22 ]

The root cause is that we generate an incorrect query plan, when $sample is before $replaceRoot, $sample is pushed down to query and performed before $group so that it samples based on original documents instead of grouped documents.

This is the plan for correct query (`$replaceRoot` then `$sample`)

db.SERVER66072.aggregate([{'$group':{_id:'$keyWords', first: {'$first':'$$CURRENT'}} }, { '$replaceRoot':{ newRoot: '$first' }}, {'$match': { '$sampleRate': 0.25 }} ] ,{explain:true}).stages
[
	{
		"$cursor" : {
			"queryPlanner" : {
				"namespace" : "test.SERVER66072",
				"indexFilterSet" : false,
				"parsedQuery" : {
 
				},
				"queryHash" : "5F5FC979",
				"planCacheKey" : "5F5FC979",
				"maxIndexedOrSolutionsReached" : false,
				"maxIndexedAndSolutionsReached" : false,
				"maxScansToExplodeReached" : false,
				"winningPlan" : {
					"stage" : "COLLSCAN",
					"direction" : "forward"
				},
				"rejectedPlans" : [ ]
			}
		}
	},
	{
		"$group" : {
			"_id" : "$keyWords",
			"first" : {
				"$first" : "$$CURRENT"
			}
		}
	},
	{
		"$replaceRoot" : {
			"newRoot" : "$first"
		}
	},
	{
		"$match" : {
			"$expr" : {
				"$lt" : [
					{
						"$rand" : {
 
						}
					},
					{
						"$const" : 0.25
					}
				]
			}
		}
	}
]

This is the plan for wrong query (`$sample` then `$replaceRoot`):

db.SERVER66072.aggregate([{'$group':{_id:'$keyWords', first: {'$first':'$$CURRENT'}} }, {'$match': { '$sampleRate': 0.25 }}, { '$replaceRoot':{ newRoot: '$first' }} ], {explain:true}).stages
 
[
    {
        "$cursor" : {
            "queryPlanner" : {
                "namespace" : "test.SERVER66072",
                "indexFilterSet" : false,
                "parsedQuery" : {
                    "$expr" : {
                        "$lt" : [
                            {
                                "$rand" : {
 
                                }
                            },
                            {
                                "$const" : 0.25
                            }
                        ]
                    }
                },
                "queryHash" : "9E0A49C9",
                "planCacheKey" : "9E0A49C9",
                "maxIndexedOrSolutionsReached" : false,
                "maxIndexedAndSolutionsReached" : false,
                "maxScansToExplodeReached" : false,
                "winningPlan" : {
                    "stage" : "COLLSCAN",
                    "filter" : {
                        "$expr" : {
                            "$lt" : [
                                {
                                    "$rand" : {
 
                                    }
                                },
                                {
                                    "$const" : 0.25
                                }
                            ]
                        }
                    },
                    "direction" : "forward"
                },
                "rejectedPlans" : [ ]
            }
        }
    },
    {
        "$group" : {
            "_id" : "$keyWords",
            "first" : {
                "$first" : "$$CURRENT"
            }
        }
    },
    {
        "$replaceRoot" : {
            "newRoot" : "$first"
        }
    }
]

Comment by Chris Kelly [ 23/May/22 ]

Hi Johan,

Thank you for your patience. I've taken a look at your issue and forwarded it to the Query Execution team to investigate further.

Christopher

Generated at Thu Feb 08 06:04:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.