[SERVER-10991] $in for large array of regexps is slow Created: 01/Oct/13 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 2.4.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | query-44-grooming | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Assigned Teams: |
Query Execution
|
||||
| Operating System: | ALL | ||||
| Steps To Reproduce: | In the attached script run create at least once, then run set qn and run query_regexp for various values of qn. |
||||
| Participants: | |||||
| Description |
|
Observed very long running queries involving $in matching a large array of regexps:
Attached script reproduces the issue. Time to run the query is quadratic or worse in the number of regexps; in following table n is number of regexps. Note that this run time is independent of the number of documents in the collection; note for example that in the reproducing script there is only a single document in the collection, and it does not match any of the regexps.
Simple gdb-based sampling profiler shows a single thread spending most of its time for the duration of the query in code that appears to be related to turning the regexps (all of which start with ^) into a set of ranges. Could this algorithm be quadratic in the number of regexps? Here's the portion of the call tree showing where the thread is spending its time; first column is average # of threads in that node of the call tree.
Note that writes are blocked while this is happening. Also please note that this has nothing to do with indexes: all the time is being spent preparing the query for execution, as per the above profile, and removing the index (which was in fact used) in the attached reproducing script does not change the behavior (as expected, since in the reproducing script no documents are in fact being matched.) Similar query using a large number of strings in a $in match did not have the same issue (see query_string in script). |
| Comments |
| Comment by Kevin Pulo [ 02/Oct/13 ] | |||||||||
|
As the non-linearity is in the number of regexps in the $in operator, but the sample query has very short regexps, a possible workaround is to combine the individual regexps into a fewer number of very long regexps. This is always possible by converting a collection of short regexps such as:
into the form:
I've attached a modified version of the reproducer which does this. The maximum length of a regexp is 32764, so for the types of regexps given, a collection of 20k regexps reduces to just 8 regexps and runs basically instantly. A collection of 100k regexps reduces to just 37, and even 1M individual regexps reduces to 367 regexps, taking only 1.5s to run (instead of forever if the full 1M regexps were passed directly to $in). |