[SERVER-29967] The regex filter operation should be applied in the IXSCAN stage if possible Created: 03/Jul/17  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Chris Harris Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 5
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-43989 Regex query not matching correctly on... Closed
Assigned Teams:
Query Optimization
Participants:
Case:

 Description   

Currently when an index is multikey the filter step for the regex match gets bumped out of the IXSCAN stage and into the FETCH stage. The logical reason for this is not immediately clear and it seems that this may be unnecessary. The result of this is that we fetch and examine documents that won't match the regex even though we could have concluded that based on information available during the IXSCAN.

diff:PRIMARY> db.version()
3.4.3
diff:PRIMARY> c.createIndex({x:1})
{
    "createdCollectionAutomatically" : false,
    "numIndexesBefore" : 1,
    "numIndexesAfter" : 2,
    "ok" : 1
}
diff:PRIMARY> c.find({x:/1/}).explain().queryPlanner.winningPlan
{
    "stage" : "FETCH",
    "inputStage" : {
        "stage" : "IXSCAN",
        "filter" : {
            "x" : {
                "$regex" : "1"
            }
        },
        "keyPattern" : {
            "x" : 1
        },
        "indexName" : "x_1",
        "isMultiKey" : false,
        "multiKeyPaths" : {
            "x" : [ ]
        },
        "isUnique" : false,
        "isSparse" : false,
        "isPartial" : false,
        "indexVersion" : 2,
        "direction" : "forward",
        "indexBounds" : {
            "x" : [
                "[\"\", {})",
                "[/1/, /1/]"
            ]
        }
    }
}
diff:PRIMARY> c.insert({x:[1,2]})
WriteResult({ "nInserted" : 1 })
diff:PRIMARY> c.find({x:/1/}).explain().queryPlanner.winningPlan
{
    "stage" : "FETCH",
    "filter" : {
        "x" : {
            "$regex" : "1"
        }
    },
    "inputStage" : {
        "stage" : "IXSCAN",
        "keyPattern" : {
            "x" : 1
        },
        "indexName" : "x_1",
        "isMultiKey" : true,
        "multiKeyPaths" : {
            "x" : [
                "x"
            ]
        },
        "isUnique" : false,
        "isSparse" : false,
        "isPartial" : false,
        "indexVersion" : 2,
        "direction" : "forward",
        "indexBounds" : {
            "x" : [
                "[\"\", {})",
                "[/1/, /1/]"
            ]
        }
    }
}



 Comments   
Comment by Asya Kamsky [ 18/Sep/17 ]

Similar explain with 3.5.13:

db.c.explain("executionStats").find({x:/j/})
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "db1.c",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"x" : {
				"$regex" : "j"
			}
		},
		"winningPlan" : {
			"stage" : "FETCH",
			"filter" : {
				"x" : {
					"$regex" : "j"
				}
			},
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"x" : 1
				},
				"indexName" : "x_1",
				"isMultiKey" : true,
				"multiKeyPaths" : {
					"x" : [
						"x"
					]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"x" : [
						"[\"\", {})",
						"[/j/, /j/]"
					]
				}
			}
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 0,
		"executionTimeMillis" : 0,
		"totalKeysExamined" : 10,
		"totalDocsExamined" : 4,
		"executionStages" : {
			"stage" : "FETCH",
			"filter" : {
				"x" : {
					"$regex" : "j"
				}
			},
			"nReturned" : 0,
			"executionTimeMillisEstimate" : 0,
			"works" : 11,
			"advanced" : 0,
			"needTime" : 10,
			"needYield" : 0,
			"saveState" : 0,
			"restoreState" : 0,
			"isEOF" : 1,
			"invalidates" : 0,
			"docsExamined" : 4,
			"alreadyHasObj" : 0,
			"inputStage" : {
				"stage" : "IXSCAN",
				"nReturned" : 4,
				"executionTimeMillisEstimate" : 0,
				"works" : 11,
				"advanced" : 4,
				"needTime" : 6,
				"needYield" : 0,
				"saveState" : 0,
				"restoreState" : 0,
				"isEOF" : 1,
				"invalidates" : 0,
				"keyPattern" : {
					"x" : 1
				},
				"indexName" : "x_1",
				"isMultiKey" : true,
				"multiKeyPaths" : {
					"x" : [
						"x"
					]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"x" : [
						"[\"\", {})",
						"[/j/, /j/]"
					]
				},
				"keysExamined" : 10,
				"seeks" : 1,
				"dupsTested" : 10,
				"dupsDropped" : 6,
				"seenInvalidated" : 0
			}
		}
	},
	"serverInfo" : {
		"host" : "Asyas-MacBook-Pro.local",
		"port" : 27017,
		"version" : "3.5.13",
		"gitVersion" : "52bbaa007cd84631d6da811d9a05b59f2dfad4f3"
	},
	"ok" : 1
}

Note (at least in current version) that left anchored regex that's case sensitive seems to correctly handle the query and only FETCH the documents that are candidates (so when there's 0 returned, it doesn't FETCH any documents).

127.0.0.1:27017(3.5.13) > db.c.explain("executionStats").find({x:/^j/})
{
	"queryPlanner" : {
		"plannerVersion" : 1,
		"namespace" : "db1.c",
		"indexFilterSet" : false,
		"parsedQuery" : {
			"x" : {
				"$regex" : "^j"
			}
		},
		"winningPlan" : {
			"stage" : "FETCH",
			"inputStage" : {
				"stage" : "IXSCAN",
				"keyPattern" : {
					"x" : 1
				},
				"indexName" : "x_1",
				"isMultiKey" : true,
				"multiKeyPaths" : {
					"x" : [
						"x"
					]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"x" : [
						"[\"j\", \"k\")",
						"[/^j/, /^j/]"
					]
				}
			}
		},
		"rejectedPlans" : [ ]
	},
	"executionStats" : {
		"executionSuccess" : true,
		"nReturned" : 0,
		"executionTimeMillis" : 0,
		"totalKeysExamined" : 1,
		"totalDocsExamined" : 0,
		"executionStages" : {
			"stage" : "FETCH",
			"nReturned" : 0,
			"executionTimeMillisEstimate" : 0,
			"works" : 2,
			"advanced" : 0,
			"needTime" : 1,
			"needYield" : 0,
			"saveState" : 0,
			"restoreState" : 0,
			"isEOF" : 1,
			"invalidates" : 0,
			"docsExamined" : 0,
			"alreadyHasObj" : 0,
			"inputStage" : {
				"stage" : "IXSCAN",
				"nReturned" : 0,
				"executionTimeMillisEstimate" : 0,
				"works" : 2,
				"advanced" : 0,
				"needTime" : 1,
				"needYield" : 0,
				"saveState" : 0,
				"restoreState" : 0,
				"isEOF" : 1,
				"invalidates" : 0,
				"keyPattern" : {
					"x" : 1
				},
				"indexName" : "x_1",
				"isMultiKey" : true,
				"multiKeyPaths" : {
					"x" : [
						"x"
					]
				},
				"isUnique" : false,
				"isSparse" : false,
				"isPartial" : false,
				"indexVersion" : 2,
				"direction" : "forward",
				"indexBounds" : {
					"x" : [
						"[\"j\", \"k\")",
						"[/^j/, /^j/]"
					]
				},
				"keysExamined" : 1,
				"seeks" : 2,
				"dupsTested" : 0,
				"dupsDropped" : 0,
				"seenInvalidated" : 0
			}
		}
	},
	"serverInfo" : {
		"host" : "Asyas-MacBook-Pro.local",
		"port" : 27017,
		"version" : "3.5.13",
		"gitVersion" : "52bbaa007cd84631d6da811d9a05b59f2dfad4f3"
	},
	"ok" : 1
}

Comment by David Storch [ 05/Jul/17 ]

After some investigation, I have determined that this behavior is intentional. I found a couple of places in the code where it is enforced during access path planning:

The reason is that the IXSCAN stage currently de-duplicates index keys pointing to the same record before it applies the covered filter to the key. These two steps happen in sequence here in IndexScan::doWork(). In order to demonstrate what would go wrong if we moved the filter from the FETCH to the IXSCAN stage, suppose that we have a document {arr: ["a", "b"]} with index {arr: 1}. The index would have two keys, "a" and "b", which point to the same record. Let's consider what happens when the client issues the query {arr: /b/}, looking for documents whose arr array contains a string with the character "b" in it. The IXSCAN would execute as follows:

  • It would first encounter key "a" and record its record id r in the set of seen documents.
  • After applying the regex to this key, it would find that the key does not match and would filter it out.
  • Next the scan encounters key "b". Since this key points to an already-seen document, it would throw out this key before applying the regex.
  • Consequently, the result set would be missing the matching result document.

It seems to me that we could fix this by only recording a RecordId as seen when the IXSCAN actually returns the key to its parent stage. If we filter out the key, then the document might still match by virtue of other keys, so we should not de-duplicate future keys with the same RecordId.

Generated at Thu Feb 08 04:22:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.