[SERVER-42518] Wildcard index plans miss results when the query path has multiple subsequent array indexes Created: 30/Jul/19 Updated: 29/Oct/23 Resolved: 22/Aug/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 4.2.0-rc4 |
| Fix Version/s: | 4.2.1, 4.3.1 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | David Storch | Assignee: | David Storch |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | KS, query-44-grooming | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v4.2
|
||||||||
| Sprint: | Query 2019-08-12, Query 2019-08-26 | ||||||||
| Participants: | |||||||||
| Description |
|
This issue was identified by joe.caswell, who observed the following correctness problem after creating a wildcard index:
When a wildcard index is present, the planner attempts to use it in order to satisfy the equality predicate over the path "a.0.0". In this example, both "0" path components resolve to array indexes. The access plan, however, does not correctly account for these array index path components. This leads to missing results. This issue should only impact queries which attempt to match a nested array by using multiple subsequent array indexes. Operations which do not query by array index, or which do not have multiple adjacent array index path components (e.g. "a.0" or "a.0.b.1") are not affected. |
| Comments |
| Comment by Githook User [ 22/Aug/19 ] |
|
Author: {'username': 'dstorch', 'email': 'david.storch@10gen.com', 'name': 'David Storch'}Message: Consider a query on a generic path "c1.c2...arr.0.1...cn" This change fixes the planner to avoid attempting to use a (cherry picked from commit 0f0463d0442b2c77cb650a8e84fea4d634fb63e6) |
| Comment by Githook User [ 22/Aug/19 ] |
|
Author: {'username': 'dstorch', 'email': 'david.storch@10gen.com', 'name': 'David Storch'}Message: Consider a query on a generic path "c1.c2...arr.0.1...cn" This change fixes the planner to avoid attempting to use a |
| Comment by David Storch [ 30/Jul/19 ] |
|
After review, the Server Query Team believes that the appropriate fix for this issue is to prevent queries which have multiple subsequent positional path components from using a wildcard index when the previous path component is known to be an array. For example, point queries over the a path like "a.0.0" will not be eligible to use a wildcard index if the index metadata indicates that "a" is multikey. In such cases, the index does not record metadata about whether "a.0" or "a.0.0" are arrays; we must defensively presume that they are array indexes, as opposed to field names, due to their spelling. If we pursue such a fix, wildcard indexes will no longer be useful for schemas which involve field names spelled like array indexes. For example, imagine a schema with documents like {a: [{"1": "foo"}}]}. The query find({"a.0.1": "foo"}) will no longer be able to use a wildcard index, since the planner cannot know that the "1" always refers to a field name in the user's schema. It looks like this new behavior can be implemented by extending the validateNumericPathComponents() function from planner_wildcard_helpers.cpp. The reason that queries like this cannot easily use the index is due to the internal index format. In a wildcard index, each key has a leading component which encodes the path as well as a trailing component which encodes the value. That is, each document is shredded into (path, value) pairs for indexing. For instance, the index keys for the document {a: 2, b: 3} with the wildcard index {"$**": 1} would be {"": "a", "": 2} and {"": "b", "": 3}. When the document has an indexed array, we follow the pre-existing behavior for multikey indexes by producing an index key for every array element. For example, the document {a: 2, b: [3, 4]} will result in three index keys: {"": "a", "": 2}, {"": "b", "": 3}, and {"": "b", "": 4}. Just like all other multikey indexes, nested arrays are not recursively expanded for indexing. Therefore, for your test document {a: [["test"]]} the index will contain only the key {"": "a", "": ["test"]}. Note that the array has been expanded just one level, as opposed to being recursively expanded. This design choice was important to make the semantics for using a multikey wildcard index to accelerate a query as similar as possible to how we build access plans for regular multikey indexes. Since nested array are not expanded, and since the array indexes are not explicitly encoded into the index keys, we cannot build useful index bounds for the query find({"a.0.0": "test"}). We could, of course, search for the key {"": "a", "": ["test"]}, but there is no good way for the planner to know a priori that this is a good idea. Should we also search for the key {"": "a", ["test", "foo"]} which could also be present for a matching document? It's possible that we could build clever index bounds that would incorporate all such index keys. This might only work well for array index "0", since the bounds necessary to capture all index keys containing arrays which have a particular value at element "n" would likely be inefficient. |