[SERVER-18734] The match $in operator is using a ValueSet(std::set) Created: 29/May/15  Updated: 04/Nov/15  Resolved: 04/Nov/15

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Antoine Hom Assignee: Charlie Swanson
Resolution: Done Votes: 1
Labels: optimization
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-18733 Streamline set cache optimization for... Backlog
Operating System: ALL
Participants:

 Description   

While implementing a feature to handle CSV like input of the form:

A,B,C // header
1,2,3
4,5,6
etc...

We naively implemented it with the following $match condition:

$or: [
    { A: 1, B: 2, C: 3},
    { A: 4, B: 5, C: 6},
    etc...
]

After seeing bad performances/scalability of this approach we tried two alternatives (these are in an aggregation pipeline):

  • One with $in:

$project: {
    computed_obj: { "1": "$A", "2": "$B", "3": "$C" }
},
$match: {
    computed_obj: { 
        $in: [
            { "1": 1, "2": 2, "3": 3 },
            { "1": 3, "2": 4, "3": 5 },
            etc...
        ]
    }
}

  • One with $setIsSubset:

$project: {
    condition_value: {
        $setIsSubset: [
            {
                $map: {
                    input: [null], 
                    as: "var__", 
                    in { "1": "$A", "2": "$B", "3": "$C" }
                }
            }, 
            [
               {"1": 1, "2": 2, "3": 3},
               {"1": 3, "2": 4, "3": 5},
               etc...
            ]
        ]
    }
}, 
$match: { condition_value: true }

We found that when starting to have big enough sets the $in approach was in fact slower and not even the same complexity than the $setIsSubset one.
We then noticed that $setIsSubset is using a std::unordered_set whereas $in is using a simple std::set.

Is there a reason why $in is using a std::set over an std::unordered_set?



 Comments   
Comment by Charlie Swanson [ 04/Nov/15 ]

It appears this ticket got lost for a while, sorry about that!

I think the difference between std::set and std::unordered_set is a bit of a red herring. After inspecting, it looks like this comes down to the internal representation of the documents. During the aggregation pipeline, we convert the BSON into an internal representation (A Document), more akin to a hash map. This representation is much faster to work with than BSON. During the $match stage, we use the same logic as a $match during a query, which processes BSON, so we have to convert the document to BSON in order to do the $match, hence, the slowdown.

I'm closing this ticket, as it would be a non-trivial amount of work to support matching on Documents, and there is a workaround (as you mentioned). Feel free to reopen if you disagree with this decision.

Comment by Antoine Hom [ 05/Jun/15 ]

Hello Ramon,

Sorry If I was not clear, this ticket explains the process that I went through to find two potential issues:

  • One is that the performance optimization done wrt constant set in $setIsSubset could be streamlined to all set operations (which is SERVER-18733).
  • The other one (which is this ticket) was to check if there were any plans to make the $in operator use a std::ordered_set rather than a std::set, because that would yield better performances (not the same complexity)
Comment by Ramon Fernandez Marina [ 01/Jun/15 ]

antoine.hom@amadeus.com, if I understand correctly SERVER-18733 is a superset of this ticket (no pun intended). If this is not the case, can you please elaborate what's different about this ticket and how a resolution of SERVER-18733 (which is tentatively scheduled for the current development cycle) may fail to address this ticket?

Thanks,
Ramón.

Generated at Thu Feb 08 03:48:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.