[SERVER-67715] Change stream reader requires double escaping regexes Created: 30/Jun/22  Updated: 29/Oct/23  Resolved: 19/Sep/22

Status: Closed
Project: Core Server
Component/s: Change streams
Affects Version/s: None
Fix Version/s: 6.0.3, 6.1.0-rc3, 6.2.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Vishnu Kaushik Assignee: Kyle Suarez
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
Backwards Compatibility: Fully Compatible
Backport Requested:
v6.1, v6.0
Sprint: QE 2022-09-19, QE 2022-10-03
Participants:

 Description   

We can open a change stream to watch events from non-system collections in a database, by using the regular expression "^system\." - starts with "system", and then an escaped .. All code below run on mongo shell connecting to single node replica set.

> t = db.watch([{$match: {"ns.coll": {$nin: [/^system\./]}}}])

However, this is NOT returning events from collections that start with "system" but are not actually system collections.

Strangely, the same regex when used in a list collections (or aggregate with $listCatalog) to show all non-system collections will work correctly ("system_js" is printed):

rs:PRIMARY> db.runCommand({listCollections: 1, filter: {name: {$nin: [/^system\./]}}})
{
	"cursor" : {
		"id" : NumberLong(0),
		"ns" : "test.$cmd.listCollections",
		"firstBatch" : [
			{
				"name" : "system_js",
				"type" : "collection",
				"options" : {
					
				},
				"info" : {
					"readOnly" : false,
					"uuid" : UUID("28c72e5c-204a-4943-b08c-b101413a2ebc")
				},
				"idIndex" : {
					"v" : 2,
					"key" : {
						"_id" : 1
					},
					"name" : "_id_"
				}
			}
		]
	},
	"ok" : 1,
	"$clusterTime" : {
		"clusterTime" : Timestamp(1656618172, 2),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	},
	"operationTime" : Timestamp(1656618172, 2)
}

In the change stream case, it seems like we have to escape the backslash as well and use "^system
." instead.

> t = db.watch([{$match: {"ns.coll": {$nin: [/^system\\./]}}}])
> t
{ "_id" : { "_data" : "8262BDF908000000012B042C0100296E5A1004CDB18D68D04F43BF868D66FC9DB28169463C6F7065726174696F6E54797065003C696E736572740046646F63756D656E744B65790046645F6964006462BDF908A431B84AA323A03E000004" }, "operationType" : "insert", "clusterTime" : Timestamp(1656617224, 1), "wallTime" : ISODate("2022-06-30T19:27:04.556Z"), "fullDocument" : { "_id" : ObjectId("62bdf908a431b84aa323a03e"), "b" : 1 }, "ns" : { "db" : "test", "coll" : "system_js" }, "documentKey" : { "_id" : ObjectId("62bdf908a431b84aa323a03e") } }

See comments for more info.



 Comments   
Comment by Githook User [ 30/Sep/22 ]

Author:

{'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com', 'username': 'ksuarz'}

Message: SERVER-67715 escape $changeStream regex

(cherry picked from commit c9fe899fff347770c0e30fa0272f6157be6676a8)
Branch: v6.0
https://github.com/mongodb/mongo/commit/fe98d0249a6be8e8122994bf76d789ed98e3a8a0

Comment by Githook User [ 19/Sep/22 ]

Author:

{'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com', 'username': 'ksuarz'}

Message: SERVER-67715 escape $changeStream regex

(cherry picked from commit c9fe899fff347770c0e30fa0272f6157be6676a8)
Branch: v6.1
https://github.com/mongodb/mongo/commit/e8ed64ed0440f576fb0c62d0be61f2f5f0749765

Comment by Githook User [ 19/Sep/22 ]

Author:

{'name': 'Kyle Suarez', 'email': 'kyle.suarez@mongodb.com', 'username': 'ksuarz'}

Message: SERVER-67715 escape $changeStream regex
Branch: master
https://github.com/mongodb/mongo/commit/c9fe899fff347770c0e30fa0272f6157be6676a8

Comment by Kyle Suarez [ 13/Sep/22 ]

britt.snyman@mongodb.com, this is a query correctness bug and I think we should keep this as a release blocker. CC bernard.gorman@mongodb.com

Comment by Wenbin Zhu [ 06/Jul/22 ]

Yeah I think 6.0.1 is fine.

Comment by Wenbin Zhu [ 01/Jul/22 ]

bernard.gorman@mongodb.com Yes we initially wanted to support replicating system.js, but recently due to some limitations (one of them being the privilege needed to create/drop system collections) we decided to not support replicating any system collections for GA, but I think after GA, we are going to add support for that.

Comment by Bernard Gorman [ 01/Jul/22 ]

wenbin.zhu@mongodb.com:

C2C needs to add this filter because we use showSystemEvents in order to get create/createIndexes events due to chunk migration, which also generates events from system collections that we need to exclude.

But {showSystemEvents:true} should only be reporting events on system.js, which C2C specifically requested in the original scope?

Comment by Wenbin Zhu [ 01/Jul/22 ]

(As an aside, it's worth noting that this filter isn't actually necessary, since change streams by default does not return any events on system collections).

C2C needs to add this filter because we use showSystemEvents in order to get create/createIndexes events due to chunk migration, which also generates events from system collections that we need to exclude.

Comment by Vishnu Kaushik [ 01/Jul/22 ]

Yes, sorry bernard.gorman@mongodb.com, that is a typo (I've fixed it now) - when using $nin, "system_js" is NOT showing up with the regex /^system\. though it should. It will show up if we use the regex 

/^system\\./

.

Comment by Bernard Gorman [ 01/Jul/22 ]

However, this is returning events from collections that start with "system" but are not actually system collections (below an event is returned for "system_js").

vishnu.kaushik@mongodb.com, did you mean that this is NOT returning events from collections like system_js?

I believe this is due to how we rewrite the $match into a filter on the oplog. If I look at the explain output for the $changeStream pipeline with this filter, I see the following:

"in" : {
	"$regexMatch" : {
		"input" : "$$oplogField",
		"regex" : {
			"$const" : "^system."
		},
		"options" : {
			"$const" : ""
		}
	}
}

Looks like the escaped period is being resolved to a literal period before being applied in the regex, causing it to match anything that starts with system and has at least one additional character after it.

(As an aside, it's worth noting that this filter isn't actually necessary, since change streams by default does not return any events on system collections).

Comment by Vishnu Kaushik [ 30/Jun/22 ]

Ok, I verified that it happens on the 6.0 binary as well, commit hash 952ed79880ec280dce20c95ce3b178036d366771.

Comment by Jennifer Peshansky (Inactive) [ 30/Jun/22 ]

From a glance, the upgrade isn't involved in and of itself, since the regex works correctly in some situations in the shell but not in others. It seems to have to do with how the change stream code parses slashes?

Comment by Kyle Suarez [ 30/Jun/22 ]

jennifer.peshansky@mongodb.com do you think that the PCRE2 Upgrade is potentially involved here?

Comment by Vishnu Kaushik [ 30/Jun/22 ]

I was running this locally with FCV 6.0, but the binary version is master.

Comment by Kyle Suarez [ 30/Jun/22 ]

vishnu.kaushik@mongodb.com what version was this run on? Master or 6.0?

Generated at Thu Feb 08 06:08:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.