-
Type:
Bug
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Aggregation Framework
-
ALL
-
None
-
None
-
None
-
None
-
None
-
None
-
None
While implementing a feature to handle CSV like input of the form:
A,B,C // header
1,2,3
4,5,6
etc...
We naively implemented it with the following $match condition:
$or: [
{ A: 1, B: 2, C: 3},
{ A: 4, B: 5, C: 6},
etc...
]
After seeing bad performances/scalability of this approach we tried two alternatives (these are in an aggregation pipeline):
- One with $in:
$project: {
computed_obj: { "1": "$A", "2": "$B", "3": "$C" }
},
$match: {
computed_obj: {
$in: [
{ "1": 1, "2": 2, "3": 3 },
{ "1": 3, "2": 4, "3": 5 },
etc...
]
}
}
- One with $setIsSubset:
$project: {
condition_value: {
$setIsSubset: [
{
$map: {
input: [null],
as: "var__",
in { "1": "$A", "2": "$B", "3": "$C" }
}
},
[
{"1": 1, "2": 2, "3": 3},
{"1": 3, "2": 4, "3": 5},
etc...
]
]
}
},
$match: { condition_value: true }
We found that when starting to have big enough sets the $in approach was in fact slower and not even the same complexity than the $setIsSubset one.
We then noticed that $setIsSubset is using a std::unordered_set whereas $in is using a simple std::set.
Is there a reason why $in is using a std::set over an std::unordered_set?
- related to
-
SERVER-18733 Streamline set cache optimization for set operations
-
- Backlog
-