[SERVER-9938] Inefficient index boundary selected when using character class regex Created: 14/Jun/13 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 2.4.4 |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Minor - P4 |
| Reporter: | Jeff lee | Assignee: | Backlog - Query Optimization |
| Resolution: | Unresolved | Votes: | 3 |
| Labels: | storch | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Query Optimization
|
||||||||
| Participants: | |||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
|
Hi, It seems that when you use a character class regex in a find operation, it results in a full index scan even when the character class is anchored. When the field is an array, it can result in a huge performance hit as each document is accessed multiple times for each indexed array element. Note how we have 8 scanned objects in the following example:
The workaround seems to be to specify each element of the character class individually:
|
| Comments |
| Comment by Guy Arad [ 11/Mar/19 ] | ||||
|
I'm experiencing a similar issue, but in my case, the wrong index is being chosen. I can understand why bounds are being expanded until the first meta character, but can't understand why the wrong index is being selected. I have these two indices:
Running this query
gives the expected performance (as seen with explain). Running this query
results in examining all the docs with the given app_key. The selected index was #2 above. Running the previous query, providing a hint for index #1 gives much better results, but still the keys examined is about 10 times the docs returned because of the bounds for the udid field: [ ["2", "3"), [/^2[bB]/, /^2[bB]/] ]. Using this query:
did not help. Actually, when using a hint of index #1 with this query, the bounds for `udid` were `[MinKey, MaxKey]` and resulted in the worst performance. Best performance achieved by not using the regex character class at all, and using `$in` instead:
Keys examined, docs examined and docs returned were all the same value. These were the bounds: `["2B", "2C"), ["2b", "2c"), [/^2B/, /^2B/], [/^2b/, /^2b/]`
(note: for the life of me, I couldn't get the code parts to work nicely in the commenting framework - sorry) |