[SERVER-65260] An index with non-simple collation should not be considered for predicates with a regex value Created: 05/Apr/22  Updated: 07/Feb/24  Resolved: 29/Jan/24

Status: Closed
Project: Core Server
Component/s: Query Planning
Affects Version/s: 6.0.0-rc0, 5.3.1
Fix Version/s: 8.0.0-rc0

Type: Task Priority: Minor - P4
Reporter: Kyle Suarez Assignee: Carlos Alonso Pérez
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File bench.js     PNG File image-2024-01-24-12-44-15-398.png     PNG File image-2024-01-24-12-44-23-817.png     PNG File image-2024-01-24-12-44-34-026.png     PNG File image-2024-01-24-12-44-58-896.png    
Issue Links:
Related
Assigned Teams:
Query Optimization
Backwards Compatibility: Fully Compatible
Sprint: QO 2024-02-05
Participants:

 Description   

Neither the existing pcre library nor the new pcre2 library support ICU collation. Therefore predicates that contain a regex value/s should not be eligible to use indexes with non-simple collations for index scans, as the index will not be selective and all strings in the index will be used:

For example, notice this explain output and the final index bounds used:

replSet:PRIMARY> db.coll.createIndex({x: 1}, {collation: {locale: "en", strength: 1}})
replSet:PRIMARY> db.coll.explain().find({x: {$regex: /^foo$/, $options: "i"}})
{
  "explainVersion" : "2",
  "queryPlanner" : {
    "namespace" : "newer.coll",
    "indexFilterSet" : false,
    "parsedQuery" : {
      "x" : {
        "$regex" : "^foo$",
        "$options" : "i"
      }
    },
    ...
    "winningPlan" : {
      "queryPlan" : {
        "stage" : "FETCH",
        "planNodeId" : 2,
        "filter" : {
          "x" : {
            "$regex" : "^foo$",
            "$options" : "i"
          }
        },
        "inputStage" : {
          "stage" : "IXSCAN",
          "planNodeId" : 1,
          "keyPattern" : {
            "x" : 1
          },
          "indexName" : "x_1",
          "collation" : {
            "locale" : "en",
            "caseLevel" : false,
            "caseFirst" : "off",
            "strength" : 1,
            "numericOrdering" : false,
            "alternate" : "non-ignorable",
            "maxVariable" : "punct",
            "normalization" : false,
            "backwards" : false,
            "version" : "57.1"
          },
          ...
          "indexBounds" : {
            "x" : [
              "[CollationKey(0x), {})",
              "[/^foo$/i, /^foo$/i]"
            ]
          }
        ...



 Comments   
Comment by Githook User [ 29/Jan/24 ]

Author:

{'name': 'Carlos Alonso', 'email': 'calonso@users.noreply.github.com', 'username': 'calonso'}

Message: SERVER-65260 Restrict index usage to prefix only regex predicates (#17968)

GitOrigin-RevId: b28efbd0d6ff5f7c8641ab66551ae0f5f9836a6f
Branch: master
https://github.com/mongodb/mongo/commit/4175b8b0f2fb63a0901a3a7f61652205b5b8296f

Comment by Chris Harris [ 05/Apr/22 ]

I agree with both of those points - that it could be faster if the distribution of values is right, but that it's probably uncommon for that to be the situation since the user is asking for regex matching on the field.  Future optimizers could factor this in, but my opinion is that we should disallow this for now.  

My suspicion is that the user could probably include a $type predicate (or hint, etc) if they wanted to use an index in this type of situation.

Comment by Kyle Suarez [ 05/Apr/22 ]

I tested this on latest master but I assume we've had this behavior since the beginning.

After thinking about this a bit more, scanning all the string values in the index would still be faster than a COLLSCAN if the queried field only has a few string values compared to other types, but I doubt that that is particularly common in practice.

Generated at Thu Feb 08 06:02:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.