SQL-style NULL semantics for equijoins are difficult to emulate with $lookup

    • Type: Improvement
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Optimization
    • None
    • 3
    • TBD
    • None
    • None
    • None
    • None
    • None
    • None

      Consider the following script which I executed against SQLite:

      DROP TABLE IF EXISTS Users;
      DROP TABLE IF EXISTS UserRoles; 
      
      CREATE TABLE Users (  
          UserID INTEGER,  
          Name TEXT,  
          Email TEXT
      );
      
      INSERT INTO Users (UserID, Name, Email) VALUES (1, 'Alice', 'alice@example.com');  
      INSERT INTO Users (UserID, Name, Email) VALUES (2, 'Bob', NULL);  
      INSERT INTO Users (UserID, Name, Email) VALUES (NULL, 'Charlie', NULL);  
      
      CREATE TABLE UserRoles (  
          UserID INTEGER,  
          Role TEXT  
      );  
      
      INSERT INTO UserRoles (UserID, Role) VALUES (1, 'Software engineer');  
      INSERT INTO UserRoles (UserID, Role) VALUES (1, 'Manager');
      INSERT INTO UserRoles (UserID, Role) VALUES (NULL, 'Marketing');
      INSERT INTO UserRoles (UserID, Role) VALUES (NULL, 'Sales');
      
      SELECT * FROM UserRoles, Users WHERE Users.UserID = UserRoles.UserId;
      

      This is a simple equijoin query where the two joined columns may contain NULLs. The results of the query are:

      UserID	Role	UserID	Name	Email
      1	Software engineer	1	Alice	alice@example.com
      1	Manager	1	Alice	alice@example.com
      

      In SQL, NULL is not equal to NULL and thus the NULL values from the left-hand side do not join with the NULL values from the right-hand side.

      Now let's try to achieve the same thing in MQL with an aggregation pipeline. An equijoin in MQL is expressed with $lookup-$unwind. Here is an equivalent script in MQL, with three variations of the join query:

      db.Users.drop();
      db.UserRoles.drop();
      
      assert.commandWorked(db.Users.insert([
          {_id: 1, UserId: 1, Name: "Alice", Email: "alice@example.com"},
          {_id: 2, UserId: 2, Name: "Bob"},
          {_id: 3, Name: "Charlie"},
      ]));
      
      assert.commandWorked(db.UserRoles.insert([
          {_id: 1, UserId: 1, Role: "Software engineer"},
          {_id: 2, UserId: 1, Role: "Manager"},
          {_id: 3, Role: "Marketing"},
          {_id: 4, Role: "Sales"},
      ]));
      
      assert.commandWorked(db.Users.createIndex({UserId: 1}));
      assert.commandWorked(db.UserRoles.createIndex({UserId: 1}));
      
      // Query 1: Use localField/foreignField
      let pipeline1 = [
          {$lookup: {from: "UserRoles", localField: "UserId", foreignField: "UserId", as: "UserRoles"}},
          {$unwind: "$UserRoles"}
      ];
      let results = db.Users.aggregate(pipeline1).toArray();
      printjson(results);
      
      // Query 2: Use $match-$expr
      let pipeline2 = [
          {$lookup: {
              from: "UserRoles",
              as: "UserRoles",
              let: {myId: "$UserId"},
              pipeline: [
                  {$match: {$expr: {$eq: ["$UserId", "$$myId"]}}}
              ],
          }},
          {$unwind: "$UserRoles"}
      ];
      let results2 = db.Users.aggregate(pipeline2).toArray();
      printjson(results2);
      
      // Query 3: Filter out null and missing from the outer side to achieve SQL NULL semantics.
      let pipeline3 = [
          {$match: {UserId: {$ne: null}}},
          {$lookup: {from: "UserRoles", localField: "UserId", foreignField: "UserId", as: "UserRoles"}},
          {$unwind: "$UserRoles"}
      ];
      let results3 = db.Users.aggregate(pipeline3).toArray();
      printjson(results3);
      

      The first two versions of the query return the following results:

      [
      	{
      		"_id" : 1,
      		"UserId" : 1,
      		"Name" : "Alice",
      		"Email" : "alice@example.com",
      		"UserRoles" : {
      			"_id" : 1,
      			"UserId" : 1,
      			"Role" : "Software engineer"
      		}
      	},
      	{
      		"_id" : 1,
      		"UserId" : 1,
      		"Name" : "Alice",
      		"Email" : "alice@example.com",
      		"UserRoles" : {
      			"_id" : 2,
      			"UserId" : 1,
      			"Role" : "Manager"
      		}
      	},
      	{
      		"_id" : 3,
      		"Name" : "Charlie",
      		"UserRoles" : {
      			"_id" : 3,
      			"Role" : "Marketing"
      		}
      	},
      	{
      		"_id" : 3,
      		"Name" : "Charlie",
      		"UserRoles" : {
      			"_id" : 4,
      			"Role" : "Sales"
      		}
      	}
      ]
      

      Note how this differs from SQL NULL semantics – in MQL, the missing values join with other missing values! (The same is true if literal null values are used rather than missing values.) That isn't necessarily incorrect from MongoDB's perspective, but it complicates any effort to properly migrate a relational application to MongoDB. The third query demonstrates how to work around the problem by explicitly filtering null and missing values prior to the join. This workaround should suffice in most cases, but it adds complexity.

      In addition to the fact that the first two MQL queries don't match the SQL-style NULL semantics, there is also a performance implication. The second pipeline (using $match-$expr) has to perform collection scan on the inner side when the value from the outer side is missing, despite the presence of an appropriate index on the UserId column. The performance impact may not be immediately apparent because most of the time the query will use an index on the inner side. However, if the query ends up having to process a document where the value is missing on the outer side at query runtime it will end up doing a collection scan on the inner side. You can see that two index accesses and one collection scan are reported in the explain for pipeline 2:

      		{
      			"$lookup" : {
      				"from" : "UserRoles",
      				"as" : "UserRoles",
      				"let" : {
      					"myId" : "$UserId"
      				},
      				"pipeline" : [
      					{
      						"$match" : {
      							"$expr" : {
      								"$eq" : [
      									"$UserId",
      									"$$myId"
      								]
      							}
      						}
      					}
      				],
      				"unwinding" : {
      					"preserveNullAndEmptyArrays" : false
      				}
      			},
      			"totalDocsExamined" : NumberLong(6),
      			"totalKeysExamined" : NumberLong(2),
      
                               // Note the collection scan reported here!
      			"collectionScans" : NumberLong(1),
      
      			"indexesUsed" : [
      				"UserId_1"
      			],
      			"nReturned" : NumberLong(4),
      			"executionTimeMillisEstimate" : NumberLong(0)
      		}
      

      The likely task for this ticket is to improve MQL to make it simpler to express SQL NULL semantics for joins. That will prevent applications migrating from relational systems from running into the relatively subtle correctness or performance pitfalls described above.

            Assignee:
            Unassigned
            Reporter:
            David Storch
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: