[SERVER-45185] $graphLookup's internal cache handles null/missing incorrectly, resulting in incorrect query results Created: 16/Dec/19  Updated: 06/Dec/22  Resolved: 08/Jul/22

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vlad Rachev (Inactive) Assignee: Backlog - Query Execution
Resolution: Duplicate Votes: 0
Labels: afz, qexec-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File explainOutput.log     File graphLookupErr.js    
Issue Links:
Initiative
includes Epic(s) SERVER-60755 We don't add documents to the graphLo... Closed
Related
related to SERVER-60755 We don't add documents to the graphLo... Closed
Assigned Teams:
Query Execution
Operating System: ALL
Steps To Reproduce:

const documentList =
 [
    {_id: 1037, "array": [[], ] }, // 187
    {_id: 1055, "obj": { "obj": {_id: 1058, "str": null, "array": [[]], }, }, }, // 190
];

const aggregation = [
        {$sort: {
                "nonexistentfield": -1,
                "obj.obj.array": 1
            }
        },
        {$graphLookup: {from: "fuzzer_coll_lookup", startWith: {$sqrt: "$obj.obj.num"
                }, connectFromField: "obj.array", connectToField: "obj.obj.str", as: "array", maxDepth: 5
            }
        },
    ];

4.3 result set:

[
	{
		"_id" : 1037,
		"array" : [
			{
				"_id" : 1055,
				"obj" : {
					"obj" : {
						"_id" : 1058,
						"str" : null,
						"array" : [
							[ ]
						]
					}
				}
			},
			{
				"_id" : 1037,
				"array" : [
					[ ]
				]
			}
		]
	},
	{
		"_id" : 1055,
		"obj" : {
			"obj" : {
				"_id" : 1058,
				"str" : null,
				"array" : [
					[ ]
				]
			}
		},
		"array" : [
			{
				"_id" : 1055,
				"obj" : {
					"obj" : {
						"_id" : 1058,
						"str" : null,
						"array" : [
							[ ]
						]
					}
				}
			}
		]
	}
]

4.2 result set:

[
	{
		"_id" : 1055,
		"obj" : {
			"obj" : {
				"_id" : 1058,
				"str" : null,
				"array" : [
					[ ]
				]
			}
		},
		"array" : [
			{
				"_id" : 1037,
				"array" : [
					[ ]
				]
			},
			{
				"_id" : 1055,
				"obj" : {
					"obj" : {
						"_id" : 1058,
						"str" : null,
						"array" : [
							[ ]
						]
					}
				}
			}
		]
	},
	{
		"_id" : 1037,
		"array" : [
			{
				"_id" : 1055,
				"obj" : {
					"obj" : {
						"_id" : 1058,
						"str" : null,
						"array" : [
							[ ]
						]
					}
				}
			}
		]
	}
]

Participants:

 Description   

When the "connect from" document contains null but the "connect to" document is missing, the "connect to" document is not correctly inserted into DocumentSourceGraphLookup::_cache. This can result in incorrect query results when there is a subsequent "connect from" value of null.

This issue was originally detected by the multiversion agg fuzzer, since it appears that the incorrect result sets can manifest differently in different versions. There is a simpler repro in this comment below and a detailed description of the issue in this comment.

Original description

I've attached a somewhat minimal agg-fuzzer repro and a file with 4.3/4.2 explain outputs. The results for 4.3/4.2 contain the same values, but they are being assigned to different fields.



 Comments   
Comment by Nicholas Zolnierz [ 08/Jul/22 ]

I've verified that this issue is fixed by SERVER-60755, which landed in 5.2 and back ported to 5.1.

Comment by Asya Kamsky [ 22/Jul/20 ]

david.storch I think the problem starts before the cache and its optimization (and you made that observation in the second comment). The original match from startWith to connectToField is performed incorrectly.

> The reason that we don't return the correct result set is indeed because of DocumentSourceGraphLookup::_cache.
Is the cache involved when we are making the first match?

> The correct result is returned for the first bread-first search.

I don't think so, unless you're saying that the first breadth-first search result is not visible in the result set once we are done?

MQL semantics say missing and null compare equal in match expressions. (Using your two documents from first comment):

db.c.aggregate({$lookup:{from:"c", localField:"foo", foreignField:"bar",as:"result"}})
{ "_id" : 1, "result" : [ { "_id" : 1 }, { "_id" : 2, "nullField" : null } ] }
{ "_id" : 2, "nullField" : null, "result" : [ { "_id" : 1 }, { "_id" : 2, "nullField" : null } ] }
db.c.aggregate({$lookup:{from:"c", localField:"nullField", foreignField:"bar",as:"result"}})
{ "_id" : 1, "result" : [ { "_id" : 1 }, { "_id" : 2, "nullField" : null } ] }
{ "_id" : 2, "nullField" : null, "result" : [ { "_id" : 1 }, { "_id" : 2, "nullField" : null } ] }

The above result is correct because whether local field doesn't exist or it's null, since foreignField doesn't exist, all documents match them.

$graphLookup with depth to 0 should be exactly the same as $lookup (note: these results are stable regardless of sort order of incoming documents)

db.c.aggregate({$graphLookup:{"from" : "c", "startWith" :"$bar", "connectFromField" :"foo", "connectToField" :"bar", "as" : "result", "depthField" : "d", "maxDepth" :0}})
{ "_id" : 1, "result" : [ ] }
{ "_id" : 2, "nullField" : null, "result" : [ ] }
db.c.aggregate({$graphLookup:{"from" : "c", "startWith" :"$nullField", "connectFromField" :"foo", "connectToField" :"bar", "as" : "result", "depthField" : "d", "maxDepth" :0}})
{ "_id" : 1, "result" : [ ] }
{ "_id" : 2, "nullField" : null, "result" : [ { "_id" : 1, "d" : NumberLong(0) }, { "_id" : 2, "nullField" : null, "d" : NumberLong(0) } ] }
db.c.aggregate({$graphLookup:{"from" : "c", "startWith" :null, "connectFromField" :"foo", "connectToField" :"bar", "as" : "result", "depthField" : "d", "maxDepth" :0}})
{ "_id" : 1, "result" : [ { "_id" : 2, "nullField" : null, "d" : NumberLong(0) }, { "_id" : 1, "d" : NumberLong(0) } ] }
{ "_id" : 2, "nullField" : null, "result" : [ { "_id" : 2, "nullField" : null, "d" : NumberLong(0) }, { "_id" : 1, "d" : NumberLong(0) } ] }

note: in your example you connected to nullField - I'm connecting to non-existent field.

It appears when we start with "missing", we never match anything in the initial $graphLookup and when we start with null we match both missing and null values. The middle example starts with value of a field that's missing in _id:1 and null in _id:2. This all happens before the cache is involved I believe. I noticed this because the results in both sort orders in your example are wrong...

So are there in fact two bugs? Original comparison being wrong and one involving the cache? It seems like original query is incorrect

A test with a single document in the collection shows that when "$startWith" is null it connects to either missing or null value in connectToField, but when "$startWith" is missing it connects to nothing. So are there two bugs here? Or fundamentally are they one?

Comment by David Storch [ 08/Jan/20 ]

The problem appears to be even more fundamental than what I described above – neither 4.2 nor 4.3 is returning the correct result set! The startWith value is null, and according to MQL semantics an $eq:null predicate matches both documents where the field is missing and where the field stores a literal null. In the document with _id:1 the connectToField is missing; in the document with _id:2 it stores a literal null. Therefore, both documents _id:1 and _id:2 should have a results array of length 2 containing _id:1 and _id:2.

The reason that we don't return the correct result set is indeed because of DocumentSourceGraphLookup::_cache. This data structure stores a map from Value to vector<Document> in memory. If a key is present in the data structure, then we assume that the associated vector<Document> should store all join partners. The bug involves a failure to correctly populate this cache, which results in missing join partners.

Let's consider what happens for the first query in my repro above, where the $graphLookup stage processes _id:1 first and then _id:2.

  1. During the bread-first search associated with _id:1 we construct a query to find all documents where nullField is equal to null. Due to the null semantics of MQL as described above, both _id:1 and _id:2 match and get recorded as visited.
  2. We call addToCache() for both join partners. We correctly record that the Value null joins with _id:2. However, we fail to record that null joins with _id:1. This leaves the cache with the incorrect mapping (null => [ {_id: 2, nullField: null} ]). The mapping in the cache should be (null => [ {_id: 1}, {_id: 2, nullField: null} ]).
  3. The correct result is returned for the first bread-first search. The next document, _id:2 is received from the child stage, and a second bread-first search begins.
  4. This second breadth-first search finds a mapping for null in the cache, and uses it to compute the query. But since the cache itself is incorrect, we end up with an incorrect results array for document _id:2.

I am flagging this ticket for re-triage now that the cause of the issue is better understood.

Comment by David Storch [ 08/Jan/20 ]

I made an interesting observation of a seemingly related bug, although I'm not 100% sure yet if it has the same root cause. Consider the following script:

(function() {
"use strict";
 
db.c.drop();
 
assert.commandWorked(db.c.insert([
    {_id: 1},
    {_id: 2, nullField: null},
]));
 
const pipeline = [
    {$sort: {_id: 1}},
    {$graphLookup: {
        from: "c",
        startWith: null,
        connectFromField: "nonExistentField",
        connectToField: "nullField",
        as: "result",
        maxDepth: 5
    }},
];
 
printjson(db.c.aggregate(pipeline).toArray());
pipeline[0].$sort._id = -1;
printjson(db.c.aggregate(pipeline).toArray());
}());

Run against a recent version of master, this produces the following output:

[
	{
		"_id" : 1,
		"result" : [
			{
				"_id" : 2,
				"nullField" : null
			},
			{
				"_id" : 1
			}
		]
	},
	{
		"_id" : 2,
		"nullField" : null,
		"result" : [
			{
				"_id" : 2,
				"nullField" : null
			}
		]
	}
]
[
	{
		"_id" : 2,
		"nullField" : null,
		"result" : [
			{
				"_id" : 2,
				"nullField" : null
			},
			{
				"_id" : 1
			}
		]
	},
	{
		"_id" : 1,
		"result" : [
			{
				"_id" : 2,
				"nullField" : null
			}
		]
	}
]

The two queries are identical aside from the $sort order they specify, but they produce fundamentally different result sets. In the first query, _id:1 joints with both itself and _id:2. In the second query, however, _id:1 joins only with _id:2. (There is a similar anomaly if we examine the join partners of _id:2 in each query.) This is a query correctness bug, since a $sort preceding a $graphLookup should not change the results of the graph search for each document.

The bug appears to be related to how the caching optimization in DocumentSourceGraphLookup is implemented for "missing" values. I need to dig deeper to understand the problem more precisely.

Generated at Thu Feb 08 05:08:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.