[SERVER-66310] Make ExpressionSetUnion::isCommutative() collation aware Created: 09/May/22  Updated: 29/Oct/23  Resolved: 27/May/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.13, 5.0.7, 4.2.20, 6.0.0-rc5, 6.1.0-rc0
Fix Version/s: 4.2.23, 4.4.17, 6.0.1, 5.0.11, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Timour Katchaounov Assignee: James Wahlin
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File fuzz-reduced.js    
Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.0, v5.0, v4.4, v4.2
Steps To Reproduce:

To reproduce run the following:

db.coll.drop();
var coll = db.coll;
 
coll.insertOne({_id: 0, time: new Date("2019-10-18T18:40:14.299Z") });
 
const pl = [ {$group: {_id: {$setUnion: [["\u0019"], [{$reduce: {input: ["xyz"], initialValue: "aaa", in: ""}}], [[{}]]]}}} ];
const collation = {locale: 'en_US', strength: 2, }
 
db.adminCommand({'configureFailPoint': 'disablePipelineOptimization', 'mode': 'off'});
 
db.runCommand({aggregate: "coll", pipeline: pl, cursor: {}, collation: collation});
db.runCommand({aggregate: "coll", pipeline: pl, cursor: {}, collation: collation, explain: true});
 
db.adminCommand({'configureFailPoint': 'disablePipelineOptimization', 'mode': 'alwaysOn'});
db.runCommand({aggregate: "coll", pipeline: pl, cursor: {}, collation: collation});
db.runCommand({aggregate: "coll", pipeline: pl, cursor: {}, collation: collation, explain: true});

Sprint: QO 2022-05-16, QO 2022-05-30
Participants:
Linked BF Score: 18

 Description   

The $setUnion aggregation expression is currently defined to be always commutative. This breaks down a collation is in place that can compare 2 different binary values as being the same. We should consider making ExpressionSetUnion::isCommutative() return false when a non-simple collation is in place.



 Comments   
Comment by Githook User [ 11/Aug/22 ]

Author:

{'name': 'James Wahlin', 'email': 'james@mongodb.com', 'username': 'jameswahlin'}

Message: SERVER-66310 Make ExpressionSetUnion::isCommutative() collation aware
Branch: v4.2
https://github.com/mongodb/mongo/commit/88f6fb33c3608ed20b55a7e0566815886a9d45f5

Comment by Alya Berciu [ 11/Aug/22 ]

The backport to 5.0 was completed a while ago, but for some reason there was no comment added by the bot (commit).

Comment by Githook User [ 11/Aug/22 ]

Author:

{'name': 'James Wahlin', 'email': 'james@mongodb.com', 'username': 'jameswahlin'}

Message: SERVER-66310 Make ExpressionSetUnion::isCommutative() collation aware
Branch: v4.4
https://github.com/mongodb/mongo/commit/c4771eda44b12596546ce97d5a7ddc28b18e7cbf

Comment by Githook User [ 21/Jul/22 ]

Author:

{'name': 'James Wahlin', 'email': 'james@mongodb.com', 'username': 'jameswahlin'}

Message: SERVER-66310 Make ExpressionSetUnion::isCommutative() collation aware

(cherry picked from commit 2c53b7b684c8dd90044b8ef19932453088f54869)
Branch: v6.0
https://github.com/mongodb/mongo/commit/3c2e77f7098157bcb30aa50f2ce3e0b53bb49a01

Comment by Githook User [ 26/May/22 ]

Author:

{'name': 'James Wahlin', 'email': 'james@mongodb.com', 'username': 'jameswahlin'}

Message: SERVER-66310 Make ExpressionSetUnion::isCommutative() collation aware
Branch: master
https://github.com/mongodb/mongo/commit/2c53b7b684c8dd90044b8ef19932453088f54869

Comment by James Wahlin [ 16/May/22 ]

This turned out to be expected behavior and the difference is due to order of insertion to a set with a non-simple collation. The following demonstrates root of the problem:

var coll = db.coll;coll.drop();
coll.insertOne({});
 
var doc = db.runCommand({aggregate: "coll", pipeline: 
    [{$project: {a: {$setUnion: [["\u0001"], [""]]}, 
                 b: {$setUnion: [[""], ["\u0001"]]}}}], 
    cursor: {}, 
    collation: {locale: 'en_US'}}).cursor.firstBatch[0];
 
assert.eq(doc.a, doc.b);

This produces:

{
	"_id" : ObjectId("6282b578b04cc82b1630e89c"),
	"a" : [
		"\u0001"
	],
	"b" : [
		""
	]
}

The reason that "a" and "b" contain different values is that the collation (default strength 3) compares empty string and the unicode string "\u0001" (a control character) to be the same. The first value inserted into the set is the one that wins.

For the original reproducer, it looks like optimizing the pipeline changes the order of set insertion. The fix will likely be to change ExpressionSetUnion::isCommutative() to return false when a non-simple collation is in place.

Comment by James Wahlin [ 16/May/22 ]

This reproduces under normal collections at least as far back as 4.2, which is when we introduced theĀ 
disablePipelineOptimization fail-point. It is possible this exists further back.

Comment by James Wahlin [ 12/May/22 ]

It appears that the ordering of the elements in the $setUnion can impact the result on 6.0+. If you move the $reduce element first on the $setUnion array then both optimized and not-optimized pipelines produce the same result.

Generated at Thu Feb 08 06:05:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.