[SERVER-83757] [CQF] Remember chunk boundaries for the duration of the query Created: 30/Nov/23  Updated: 18/Jan/24  Resolved: 18/Jan/24

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.3.0-rc0

Type: Improvement Priority: Major - P3
Reporter: Svilen Mihaylov (Inactive) Assignee: David Percy
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Sprint: QO 2023-12-11, QO 2023-12-25, QO 2024-01-08, QO 2024-01-22
Participants:

 Description   

We can remember the chunk starting points and keep them consistent for the duration of the query in the hope that this will achieve greater degree of stability and repeatability of the sampling process. Before we issue any sampling queries, we issue a “chunk collection” query. Using the example above, this query will return 5 record ids. We will then store those in a vector. We’ll modify the sampling query above to effectively replace the PhysicalScan with Limit 5 with a ValueScan node which will sequentially return record ids. The rest of the sampling plan remains unchanged.



 Comments   
Comment by Githook User [ 18/Jan/24 ]

Author:

{'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}

Message: SERVER-83757 [CQF] Draw one sample to estimate all predicates in a query

When a query has multiple predicates, and the sampling CE method is
used, we run a sampling query for each predicate separately. The
sampling query is responsible for both 1. choosing a random sample of
documents 2. counting the fraction of sampled documents that match
the predicate.

Using a separate sample for each predicate can lead to "contradictory"
estimates. For example, you could estimate that

{c: "US", s: "NY"}

matches more documents than

{c: "US"}

– this is a bad estimate because
adding the

{s: "NY"}

filter cannot possibly increase the number of
matching documents.

This commit prevents these contradictory estimates by (conceptually)
drawing a single sample and reusing it for all predicates in the query.

Physically, we materialize only a handful of record IDs: one per chunk,
controlled by 'internalCascadesOptimizerSampleChunks'.

GitOrigin-RevId: 36b07d142f0cf65d4b554160d67e152c64964262
Branch: master
https://github.com/mongodb/mongo/commit/94862ff496e352fe1c15aced7a42a128fd01860b

Comment by Githook User [ 18/Jan/24 ]

Author:

{'name': 'David Percy', 'email': 'david.percy@mongodb.com', 'username': 'dpercy'}

Message: SERVER-83757 [CQF] Add missing properties in sampling ImplementationVisitor.

In a previous ticket SERVER-81589 we introduced this
ImplementationVisitor to avoid using OptPhaseManager for sampling
queries. It's responsible for producing a physical plan, and associated
properties in _propsMap. However we were missing a map entry for the
LimitSkipNode.

GitOrigin-RevId: bd0a5cdba6720cac56d60b4f5848ac13166789df
Branch: master
https://github.com/mongodb/mongo/commit/ba9c9d09055eb5a0b07b12c0755e9d6ce5dc43d9

Generated at Thu Feb 08 06:53:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.