[SERVER-60373] Duplicate predicates in query plan for time-series collection Created: 01/Oct/21  Updated: 27/Oct/23  Resolved: 11/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Rui Liu Assignee: Naama Bareket
Resolution: Gone away Votes: 0
Labels: greenerbuild, quick-tech-debt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-75998 Complete TODO listed in SERVER-60373 Closed
related to SERVER-22857 eliminate redundant conditions/clause... Closed
is related to SERVER-75079 Simplify boolean expressions before f... Closed
is related to SERVER-31360 MatchExpression::getOptimizer() for $... Closed
Assigned Teams:
Query Integration
Operating System: ALL
Steps To Reproduce:

db.createCollection("demo", {timeseries: {timeField: "time", granularity: "minutes"}})
db.adminCommand({enableSharding: "test"})
db.adminCommand({shardCollection: "test.demo", key: {time: 1}})
 
db.demo.insert({time: ISODate("2021-10-01T21:00:00Z")})
db.demo.find({time: ISODate("2021-10-01T21:00:00Z")}).explain()

Sprint: QO 2021-11-15, QO 2021-11-29, QO 2021-12-13, QO 2021-12-27, QO 2022-01-10, QO 2022-01-24, QO 2022-02-07, QO 2022-02-21, QO 2022-03-07, QO 2022-03-21, QO 2022-04-04, QO 2022-04-18, QO 2022-05-02, QO 2022-05-16, QO 2022-05-30, QO 2022-06-13, QO 2022-06-27, QO 2022-07-11, QO 2022-07-25, QO 2022-08-08, QO 2022-08-22, QO 2022-09-05, QO 2022-09-19, QO 2022-10-03, QE 2022-10-17
Participants:

 Description   

Querying a time-series collection produces a confusing query plan that has each $match predicate duplicated twice.

"winningPlan" : {
	"stage" : "SHARDING_FILTER",
	"inputStage" : {
		"stage" : "FETCH",
		"filter" : {
			"$and" : [
				{
					"_id" : {
						"$lte" : ObjectId("615776d0ffffffffffffffff")
					}
				},
				{
					"_id" : {
						"$lte" : ObjectId("615776d0ffffffffffffffff")
					}
				},
				{
					"_id" : {
						"$gte" : ObjectId("615625500000000000000000")
					}
				},
				{
					"_id" : {
						"$gte" : ObjectId("615625500000000000000000")
					}
				},
				{
					"control.max.time" : {
						"$_internalExprGte" : ISODate("2021-10-01T21:00:00Z")
					}
				},
				{
					"control.max.time" : {
						"$_internalExprGte" : ISODate("2021-10-01T21:00:00Z")
					}
				}
			]
		},
		"inputStage" : {
			"stage" : "IXSCAN",
			"keyPattern" : {
				"control.min.time" : 1
			},
			"indexName" : "control.min.time_1",
			"isMultiKey" : false,
			"multiKeyPaths" : {
				"control.min.time" : [ ]
			},
			"isUnique" : false,
			"isSparse" : false,
			"isPartial" : false,
			"indexVersion" : 2,
			"direction" : "forward",
			"indexBounds" : {
				"control.min.time" : [
					"[new Date(1633035600000), new Date(1633122000000)]"
				]
			}
		}
	}
}

 

 



 Comments   
Comment by Githook User [ 14/Apr/23 ]

Author:

{'name': 'Naama Bareket', 'email': 'naama.bareket@mongodb.com', 'username': 'naama-bareket'}

Message: SERVER-75998: Complete TODO listed in SERVER-60373
Branch: master
https://github.com/mongodb/mongo/commit/5e64b13ddd90dfa342559bbbd261f653c21e5453

Comment by Arun Banala [ 11/Apr/23 ]

It looks like rui.liu@mongodb.com change to serialize _eventFilter inside the $_internalUnpackBucket as part of SERVER-70269, addressed this problem for sharded time-series collections. The first time we call DocumentSourceInternalUnpackBucket::doOptimizeAt(), the suffix $match stage will always be absorbed into the DocumentSourceInternalUnpackBucket stage. So the next time DocumentSourceInternalUnpackBucket::doOptimizeAt() is called, there won't be a suffix $match stage, and the optimize call ends up becoming a no-op.

Comment by Matt Boros [ 07/Oct/22 ]

Unassigning, as there seems to be a few more tests that fail and possibly other changes to make this work. Here's this latest branch: SERVER-60373. Latest patch looks like the JS test written needs to be changed.

Comment by Matt Boros [ 22/Nov/21 ]

Unfortunately I didn't have the chance to complete this ticket before my rotation ended. Here is the branch with the progress I made. Deduplication was implemented in MatchExpression::sortTree where the predicates are sorted. Performance benchmarks seemed to indicate this wouldn't create a regression. It looks like some tests need to be edited to account for the deduplication.

Comment by Arun Banala [ 21/Oct/21 ]

That could be a reasonable solution. When merging $match stages with an $and with two (or more) children here, we could check if there are any duplicating predicates.

Comment by David Percy [ 20/Oct/21 ]

Thanks!

If it does turn out to be too expensive to combine redundant predicates in CanonicalQuery, I wonder if it would make sense to do when combining $match stages.

Comment by Arun Banala [ 20/Oct/21 ]

david.percy There are two different issue here.
1. The one you mentioned about $and expression not de-duplicating it's children during the optimization. I assume this might be a conscious decision, as the de-duplication might be an expensive operation for large queries.
2. For queries on sharded views, we do the optimization on the pipeline twice. The first time, when the query is sent on the view definition to the primary shard, and the second time when the query is send on the base collection (after the kickback to mongos). For normal views, this is ok, because the pipeline optimizations are generally idempotent. But the DocumentSourceInternalUnpackBucket::doOptimizeAt() is not idempotent. Every time we call this function, it can add a $match stage. So in this case, we end up generating duplicate $match stages, and I assume they get merged into a single stage because of an optimization in DocumentSourceMatch stage.

Comment by David Percy [ 19/Oct/21 ]

Seems related to SERVER-22857.

Comment by David Percy [ 19/Oct/21 ]

Should we do the same thing on a non-timeseries collection?

> db.c.find({$and: [ {_id: {$lt: 5}}, {_id: {$lt: 6}} ]}).explain()
{
        "explainVersion" : "2",
        "queryPlanner" : {
                "namespace" : "test.c",
                "indexFilterSet" : false,
                "parsedQuery" : {
                        "$and" : [
                                {
                                        "_id" : {
                                                "$lt" : 5
                                        }
                                },
                                {
                                        "_id" : {
                                                "$lt" : 6
                                        }
                                }
                        ]
                },

Generated at Thu Feb 08 05:49:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.