[SERVER-45185] $graphLookup's internal cache handles null/missing incorrectly, resulting in incorrect query results Created: 16/Dec/19 Updated: 06/Dec/22 Resolved: 08/Jul/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Vlad Rachev (Inactive) | Assignee: | Backlog - Query Execution |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | afz, qexec-team | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Query Execution
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: |
4.3 result set:
4.2 result set:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
When the "connect from" document contains null but the "connect to" document is missing, the "connect to" document is not correctly inserted into DocumentSourceGraphLookup::_cache. This can result in incorrect query results when there is a subsequent "connect from" value of null. This issue was originally detected by the multiversion agg fuzzer, since it appears that the incorrect result sets can manifest differently in different versions. There is a simpler repro in this comment below and a detailed description of the issue in this comment. Original descriptionI've attached a somewhat minimal agg-fuzzer repro and a file with 4.3/4.2 explain outputs. The results for 4.3/4.2 contain the same values, but they are being assigned to different fields. |
| Comments |
| Comment by Nicholas Zolnierz [ 08/Jul/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've verified that this issue is fixed by | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 22/Jul/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
david.storch I think the problem starts before the cache and its optimization (and you made that observation in the second comment). The original match from startWith to connectToField is performed incorrectly. > The reason that we don't return the correct result set is indeed because of DocumentSourceGraphLookup::_cache. > The correct result is returned for the first bread-first search. I don't think so, unless you're saying that the first breadth-first search result is not visible in the result set once we are done? MQL semantics say missing and null compare equal in match expressions. (Using your two documents from first comment):
The above result is correct because whether local field doesn't exist or it's null, since foreignField doesn't exist, all documents match them. $graphLookup with depth to 0 should be exactly the same as $lookup (note: these results are stable regardless of sort order of incoming documents)
note: in your example you connected to nullField - I'm connecting to non-existent field. It appears when we start with "missing", we never match anything in the initial $graphLookup and when we start with null we match both missing and null values. The middle example starts with value of a field that's missing in _id:1 and null in _id:2. This all happens before the cache is involved I believe. I noticed this because the results in both sort orders in your example are wrong... So are there in fact two bugs? Original comparison being wrong and one involving the cache? It seems like original query is incorrect A test with a single document in the collection shows that when "$startWith" is null it connects to either missing or null value in connectToField, but when "$startWith" is missing it connects to nothing. So are there two bugs here? Or fundamentally are they one? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 08/Jan/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The problem appears to be even more fundamental than what I described above – neither 4.2 nor 4.3 is returning the correct result set! The startWith value is null, and according to MQL semantics an $eq:null predicate matches both documents where the field is missing and where the field stores a literal null. In the document with _id:1 the connectToField is missing; in the document with _id:2 it stores a literal null. Therefore, both documents _id:1 and _id:2 should have a results array of length 2 containing _id:1 and _id:2. The reason that we don't return the correct result set is indeed because of DocumentSourceGraphLookup::_cache. This data structure stores a map from Value to vector<Document> in memory. If a key is present in the data structure, then we assume that the associated vector<Document> should store all join partners. The bug involves a failure to correctly populate this cache, which results in missing join partners. Let's consider what happens for the first query in my repro above, where the $graphLookup stage processes _id:1 first and then _id:2.
I am flagging this ticket for re-triage now that the cause of the issue is better understood. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Storch [ 08/Jan/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I made an interesting observation of a seemingly related bug, although I'm not 100% sure yet if it has the same root cause. Consider the following script:
Run against a recent version of master, this produces the following output:
The two queries are identical aside from the $sort order they specify, but they produce fundamentally different result sets. In the first query, _id:1 joints with both itself and _id:2. In the second query, however, _id:1 joins only with _id:2. (There is a similar anomaly if we examine the join partners of _id:2 in each query.) This is a query correctness bug, since a $sort preceding a $graphLookup should not change the results of the graph search for each document. The bug appears to be related to how the caching optimization in DocumentSourceGraphLookup is implemented for "missing" values. I need to dig deeper to understand the problem more precisely. |