[SERVER-76060] sbe::value::serializeValue/deserializeValue omits ArraySet collator Created: 13/Apr/23  Updated: 30/Oct/23  Resolved: 10/Jul/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Ivan Fefer Assignee: Rui Liu
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Assigned Teams:
Query Execution
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0, v6.0
Sprint: QE 2023-05-29, QE 2023-06-26, QE 2023-07-10, QE 2023-07-24
Participants:

 Description   

When we serialize ArraySet on src/mongo/db/exec/sbe/values/row.cpp#L276 we don't include any information about collation. 

So when we deserialize it on L118 we create new ArraySet without collation, which can potentially lead to problems.



 Comments   
Comment by Githook User [ 27/Oct/23 ]

Author:

{'name': 'Rui Liu', 'email': 'lriuui0x0@gmail.com', 'username': 'lriuui0x0'}

Message: SERVER-81390 Use collator to create the record key in hash_agg

SERVER-76060 Deserialize SBE ArraySet with collator

(cherry picked from commit a19074b842b752ee0a61810e0b8f6d79c5aa80c1)

SERVER-81390 Use collator to create the record key in hash_agg

(cherry picked from commit d0811e844e8566dc276fcd73fceabec71c0e2717)
Branch: v6.0
https://github.com/mongodb/mongo/commit/62d51d7f2698547635f4df12049f70ba283b74b7

Comment by Githook User [ 27/Oct/23 ]

Author:

{'name': 'Rui Liu', 'email': 'lriuui0x0@gmail.com', 'username': 'lriuui0x0'}

Message: SERVER-81390 Use collator to create the record key in hash_agg

SERVER-76060 Deserialize SBE ArraySet with collator

(cherry picked from commit a19074b842b752ee0a61810e0b8f6d79c5aa80c1)

SERVER-81390 Use collator to create the record key in hash_agg

(cherry picked from commit d0811e844e8566dc276fcd73fceabec71c0e2717)
Branch: v7.0
https://github.com/mongodb/mongo/commit/62c4770285c731eb38bc0148c2f2d2fbc9f43dcb

Comment by Githook User [ 10/Jul/23 ]

Author:

{'name': 'Rui Liu', 'email': 'lriuui0x0@gmail.com', 'username': 'lriuui0x0'}

Message: SERVER-76060 Deserialize SBE ArraySet with collator
Branch: master
https://github.com/mongodb/mongo/commit/a19074b842b752ee0a61810e0b8f6d79c5aa80c1

Comment by David Storch [ 01/Jun/23 ]

As discussed over Slack, rui.liu@mongodb.com is going to write a script to prove that there is a real bug here. Something like $group with an $addToSet that has to spill and where the query uses a case-insensitive collation. In this case, after merging spilled partial aggregates, the $addToSet value could end up with strings that only differ by their case, meaning that the case-insensitive collation was not respected.

Comment by Rui Liu [ 12/May/23 ]

rushan.chen@mongodb.com Sorry missed the comment. Nope, the accumulator we plan to implement will not use those data structures and the collator is not saved in the state.

Comment by Ivan Fefer [ 19/Apr/23 ]

So the answer is "it depends".

Previous similar bug SERVER-75752 was only a performance issue, as ArraySet is built from already unique elements (from query collator point of view) and VM would fall back to linear search if collator passed to collIsMember doesn't match array set collator.

Here we are dealing with general serialization/deserialization function, which doesn't return exactly the same object after serialization/deserialization.

So imagine SBE code that does the following:

  1. Creates ArraySet with "case insensitive" collator.
  2. Add string "A" to ArraySet
  3. Serialize
  4. Deserialize
  5. Add string "a".
  6. We could reasonably expect ArraySet to contain only one string, but it will actually contain 2.

However, this actually doesn't happen: to my knowledge there is no valid MQL query that will produce something like that.
Current when we serialize ArraySet's, we only full scanning them after deserialization to do something like setUnion.
And for this case, actually omitting collation gives us performance boost, because we don't have to deal with collation keys, when re-building ArraySet.

So to conclude, current version works and it performs better than if we support collation, but it is not "future-proof".

Maybe we can "future-proof" it by explicitly converting ArraySet to Array when serializing, so this behavior would be expected and documented in code and unit tests?

Comment by Kyle Suarez [ 18/Apr/23 ]

After discussion in the triage meeting, we weren't sure this was a bug. david.storch@mongodb.com thought that if we omit a collator then we would no longer use the hash table and instead you would need to walk the data structure linearly?

ivan.fefer@mongodb.com, is this a correctness issue or are you thinking of this as a performance improvement?

Generated at Thu Feb 08 06:31:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.