[SERVER-65260] An index with non-simple collation should not be considered for predicates with a regex value Created: 05/Apr/22 Updated: 07/Feb/24 Resolved: 29/Jan/24 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Query Planning |
| Affects Version/s: | 6.0.0-rc0, 5.3.1 |
| Fix Version/s: | 8.0.0-rc0 |
| Type: | Task | Priority: | Minor - P4 |
| Reporter: | Kyle Suarez | Assignee: | Carlos Alonso Pérez |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Assigned Teams: |
Query Optimization
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Sprint: | QO 2024-02-05 | ||||
| Participants: | |||||
| Description |
|
Neither the existing pcre library nor the new pcre2 library support ICU collation. Therefore predicates that contain a regex value/s should not be eligible to use indexes with non-simple collations for index scans, as the index will not be selective and all strings in the index will be used: For example, notice this explain output and the final index bounds used:
|
| Comments |
| Comment by Githook User [ 29/Jan/24 ] |
|
Author: {'name': 'Carlos Alonso', 'email': 'calonso@users.noreply.github.com', 'username': 'calonso'}Message: GitOrigin-RevId: b28efbd0d6ff5f7c8641ab66551ae0f5f9836a6f |
| Comment by Chris Harris [ 05/Apr/22 ] |
|
I agree with both of those points - that it could be faster if the distribution of values is right, but that it's probably uncommon for that to be the situation since the user is asking for regex matching on the field. Future optimizers could factor this in, but my opinion is that we should disallow this for now. My suspicion is that the user could probably include a $type predicate (or hint, etc) if they wanted to use an index in this type of situation. |
| Comment by Kyle Suarez [ 05/Apr/22 ] |
|
I tested this on latest master but I assume we've had this behavior since the beginning. After thinking about this a bit more, scanning all the string values in the index would still be faster than a COLLSCAN if the queried field only has a few string values compared to other types, but I doubt that that is particularly common in practice. |