[SERVER-26621] Make in-memory sorts use CollatorInterface::compare() rather than CollatorInterface::getComparisonKey() Created: 13/Oct/16  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: David Storch Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-26129 Investigate perf overhead of collation Closed
Assigned Teams:
Query Execution
Participants:

 Description   

The CollatorInterface provides two mechanisms for comparing strings in a collation-aware fashion: compare() and getComparisonKey(). The former takes two strings and returns the result of the comparison. The latter returns an array of bytes such that memcmp against another comparison key yields the same results as compare().

The ICU documentation (see Sortkeys vs Comparison here) notes that generating an ICU comparison key is many times more expensive than doing a direct comparison. Profiles captured from mongod's integration with ICU confirms this to be the case. However, memcmp is also cheaper than compare(). This means that comparison keys should be used when you expect to compare a string repeatedly (say, hundreds of times), whereas direct comparison should be used in other cases. For example, we generate and store comparison keys in indexes, since we want to be able to repeatedly make cheap comparisons against these keys.

When performing an in-memory SORT, we currently generate comparison keys for all of the strings to be sorted, and then sort them with memcmp. My experiments show that, especially for small in-memory sorts, it is faster to sort via direct comparison. It would probably take a very large in-memory sort to cross the threshold, such that the repeated calls to CollatorInterface::compare() for the average element exceed the cost of generating the comparison key for that element.



 Comments   
Comment by David Storch [ 13/Oct/16 ]

This work would require changes to the mongos query path. Right now, mongos can merge sorted streams from the shards without knowledge of the collation because mongod gives back collator-generated comparison keys inside its sortKey $meta projection field. Mongos can therefore obtain the collation ordering by making simple binary comparisons of the $meta-projected sort keys.

Generated at Thu Feb 08 04:12:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.