[SERVER-42160] $group stages that use a DISTINCT_SCAN do not use a SHARDING_FILTER on sharded collections Created: 11/Jul/19  Updated: 24/Jul/23

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: George Wangensteen Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 0
Labels: qexec-team, query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-5477 when sharded, no need to merge groups... Backlog
Related
related to SERVER-13116 distinct isn't sharding aware Backlog
is related to SERVER-55200 DISTINCT_SCAN not used for $sort+$mat... Backlog
Assigned Teams:
Query Optimization
Operating System: ALL
Steps To Reproduce:

Assuming d.data is a sharded collection with orphan documents 

db.data.explain().aggregate({$group: {_id: '$_id', i: {$sum: '$i'}}})

The query plan returned here uses a sharding filter stage as this group cannot leverage a distinct scan 

db.data.explain().aggregate({$group: {_id: '$_id', i: {$first: '$i'}}})

The query plan returned here does not use a sharding filter stage as this group can leverage a distinct scan 

This was determined mostly by code/query plan inspection as reproducing an error resulting from this is timing-dependent. 

Sprint: Query 2019-07-29, Query 2019-08-12, Query 2019-08-26, Query 2019-09-09, Query 2019-09-23, Query 2019-10-07, Query 2019-10-21, Query 2019-12-30, Query 2020-03-23, Query 2020-04-06, QE 2021-10-18, QE 2021-11-01, QE 2021-11-15, QO 2022-02-21, QO 2022-03-07
Participants:

 Description   

$group stages that leverage a DISTINCT_SCAN in their execution do not produce a SHARDING_FILTER stage in their query plans. This presents a problem when orphan documents persist at entry to such a stage and potentially effect the result of the $group.

Interestingly, because $groups using a DISTINCT_SCAN need to examine only one document for each value of the group key to execute, the problem is generally hidden; such $group's naturally "deduplicate" the orphan documents by selecting only one document for each value of the group key. However, there are (at least) three cases where this still presents problems:
1) If documents are updated before a $group using a DISTINCT_SCAN runs, and before orphan documents are otherwise purged, there is a potential for the $group to use and pass forward stale values that it takes from the orphan documents rather than the "live", updated ones.
2) If https://jira.mongodb.org/browse/SERVER-5477 is merged, then $groups using DISTINCT_SCANS will run exclusively on the shards when the $group keys superset the shard keys. This will exacerbate this issue because it means orphans won't be deduplicated "accidentally" by the $group, as in this case the $group runs under the assumption that all documents with the same value for the group keys are on the same shard.

3) If orphan documents persist after their "parent" documents have been deleted, the distinct scan can return stale values from these documents. 



 Comments   
Comment by Charlie Swanson [ 07/Mar/22 ]

After discussion, we think that some of the solutions outlined above are worth pursuing, but they are expensive enough that we won't schedule this ticket on its own. With a fix on the horizon, we're not interested in risking a performance change to fix this bug at the moment.

After confirming that some of the correct use cases (e.g. using shard key in the index) are still working, I'm moving this ticket back to the backlog and linking it to a future project to improve grouping performance on sharded clusters where this work will fit in well.

Comment by Charlie Swanson [ 14/Feb/22 ]

An update to this ticket after picking it up and dusting it off:

  • The consensus seems to be that in order to fix this ticket, we need to do better than just preventing DISTINCT_SCAN optimization when a SHARD_FILTER stage would also need to be present. To be clear, that is an option engineering wise and would fix the correctness issue, but would result in much slower plans for these types of queries. The two stages are currently incompatible from a correctness perspective except when the distinct key is exactly the shard key. There is no existing logic to specifically look for this overlap. So, in approximate order of increasing complexity, we have these ideas:
    1. Do nothing (maintains performance characteristics but leaves an outstanding correctness issue). Probably we would schedule a project later to do one of the more ambitious things below, which should get easier with time in the new optimizer.
    2. Allow DISTINCT_SCAN and SHARD_FILTER on a single field which is exactly the shard key. (easy)
    3. Option to think about and expand for dotted cases? e.g. shard key is "a.b" distinct on "a"? (easy-ish? More nuanced)
    4. Add logic to permit DISTINCT_SCAN + SHARD_FILTER in cases where the index has both the distinct key and all shard key fields, but the two can be different. We think this would require some decent amount of new work. Either:
      1. The DISTINCT_SCAN stage would have to be modified and integrated with the SHARD_FILTER stage (ian.boros and I were thinking of making a more generic "extra predicate" to be applied while doing the scan) (easy to medium, but not great architecturally and risky performance-wise)
      2. The DISTINCT_SCAN would need to distinct on the entire index key. This would make SHARD_FILTER correct, but would produce non-unique keys which would require further distinct-ing. Because it is producing one key per unique shard key, it may also be substantially less efficient. In cases where the distinct key is the prefix, this could work pretty well in theory, but it may be tricky to always improve plan quality with this kind of strategy. (medium, risky performance-wise)
        • Note that it appears that the distinct command is capable of post-processing to de-duplicate, but the aggregation pipeline would need some tweaking.
    5. Add a fully generic mechanism to allow the combination of DISTINCT_SCAN and SHARD_FILTER in any scenario (notably now including those requiring a FETCH, where the shard key is not in the DISTINCT_SCAN index). This seems possible but would require building a pretty complex query tree. The idea would be that DISTINCT_SCAN will produce all distinct keys which we know is a superset of the correct answer. When we would normally apply a SHARD_FILTER we could double-check via a sub-query (sub-pipeline? sub-plan? something) to see if there is a non-orphaned document which contains that key. If we would have filtered out the one value, but there is a version of that key which does belong on this shard, we can keep that value. I think this would work pretty well if there is an index available on both the distinct key and the shard key for quick lookup, or if the shard key is relatively unique (and we have a shard key index for sure), but may require a hash join to be a palatable solution in general, and even then I would question a heuristic to use DISTINCT_SCAN. (hard)
  • Note that for the cases not "fixed" by any bullet, we have the option of leaving the existing incorrect case, or disallowing DISTINCT_SCAN. I was imagining disallowing DISTINCT_SCAN so that the correctness problem disappears, but that does leave a risk of substantial performance regression for the cases which are not called out and fixed. Option 5 seems like the best hope of fixing everything and maintaining performance, but it unfortunately doesn't seem like we could guarantee a performance win over a non-DISTINCT_SCAN alternative.
  • #4 and #5 would potentially build some complex query plans. This is not necessarily all that scary, but it would create some serious complexity or tension around duplicating or combining operators from PlanStages and DocumentSources. We know we want PlanStages for the DISTINCT_SCAN logic, but some existing operators like DocumentSourceLookup or DocumentSourceInternalShardFilter may be handy in combination. Unfortunately, we're not well positioned to build a hybrid tree like this without leading to a crazy nesting doll type plan where we go PlanStages within DocumentSources within a PlanStage within DocumentSources... doesn't sound fun. There are some other ideas which I won't record here for brevity.
  • Because of that, ian.boros and I were considering some options for #4 and #5 which would be built exclusively in SBE-land. This would allow us to compose operators that are all in the same "space" (uniformly SBE), but would probably necessitate some fairly substantial new planning logic in the query planner immediately around SBE. It's my understanding that the SBE world doesn't consider DISTINCT_SCAN plans today, and the logic for $group pushdown is helpful but not extremely well positioned to solve this type of problem. So we'd need all the logic like this to rule out inapplicable indexes, and probably more complex logic if we want to build upon that for options #3 or #4.

I'm going to explore option 2 and option #4.1 since they should be relatively simple to test.

cc justin.seyster since you provided the most recent patch for this which just banned DISTINCT_SCAN in cases where SHARD_FILTER would make it incorrect. cc asya since you were part of the decision crew for "we need to do better than just preventing DISTINCT_SCAN optimization when a SHARD_FILTER stage would also need to be present"

Comment by Justin Seyster [ 03/Sep/19 ]

Update on this: The fix is triggering some test failures, which I plan to diagnose before sending to code review.

Comment by Justin Seyster [ 29/Aug/19 ]

The most likely fix here is to ban DISTINCT_SCAN on shards when executing a sharded query, and I have some code written up that will do that. This change feels a little fraught, so I want to consider all the cases. Note that this will also affect the distinct command, which uses the same code path to generate its query plan.

1) Distinct command on a sharded collection: I have an integration test written that gets an incorrect result from a distinct command, because it returns a result from an orphaned document. (I didn't look into it, but this but might have been around since before SERVER-9507.) With the fix in place, customers may see a loss in performance, because they will need to use an index scan or collection scan where a distinct scan was previously possible.

  • In general, it is not correct to attach a distinct scan to a shard filter, forcing us to ban the distinct scan.
  • However, we could safely attach a distinct scan to a shard filter when the shard key is a prefix of the distinct field (e.g., distinct("a.b") when the shard key is "a"). My current fix does not consider this case, and we will lose performance unnecessarily as a result. It's probably not worth the complexity, though, to address this case.

2) Pipelines with $sort followed by $group: The pipeline gets split at the $sort stage before the SERVER-9507 (use a distinct scan for $group when possible) examines the pipeline. The $group stage ends up in the merging part of the pipeline, so there is never a chance to use a distinct scan. This case will not be affected by my fix.
3) Pipelines with $group but without $sort: These pipelines get split in a way that sends a $group to the shard, and the shard may choose a distinct scan, even though no sort was requested (SERVER-9507). As with case 1, there is an obscure possibility of an incorrect result because of orphaned documents. Customers may see a performance loss as a result of the fix.
-Also as with case 1, it would be possible to keep the distinct scan optimization in place when the shard key is a prefix of the group-by field, but I don't plan to check for that edge case.

I'm going to spend a little more time on testing and then put my proposed fix up for code review.

Comment by Justin Seyster [ 11/Jul/19 ]

asya This bug is possible in almost all cases when there are orphan documents living in shards targeted by the $group. If a shard finds an index that it can use with the distinct scan optimization, it generates a plan with no shard filter, and there is a possibility that it will return an orphan document to the merging node.

To fix this bug, we will make the distinct scan optimization aware of the shard filter, but that will mean that shard filtering makes this filtering impossible in almost all cases (meaning it will only benefit non-sharded collections). The distinct scan is only compatible with the shard filter if the shard key is a prefix of the field we are grouping by.

Comment by Asya Kamsky [ 11/Jul/19 ]

This is only the case when the distinct value is not on the shard key, right?

Comment by George Wangensteen [ 11/Jul/19 ]

After a good conversation about this with ian.boros, we've come to the unfortunate conclusion that (as currently implemented) distinct scans are never guaranteed to be correct on sharded collections. Consider the case where we attempted to have a distinct scan perform a shard filter. If the distinct scan found an orphan document and filtered it, it would have no means to reach other documents with the same index key value as the orphan, so it would jump to the next index key value. This would result in incorrect results, and the only way to fix it would be to change the distinct scan interface so that it has the means to access more than one document with the same group key, and scan documents with the same group key value until it finds one that isn't an orphan. This, though, would make the distinct scan as hypothetically slow as an index scan, although in theory there shouldn't be too many orphan documents to make it so. In any case this will require a not-insignificant change to the implementation of distinct scans to fix. 

Comment by George Wangensteen [ 11/Jul/19 ]

CC justin.seyster I believe you're most familiar with $group using distinct scans from https://jira.mongodb.org/browse/SERVER-9507 so let me know if you think the description here can be improved. 

Generated at Thu Feb 08 04:59:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.