[SERVER-66248] Projection with ExpressionObject does not round trip (serialize + parse) Created: 05/May/22  Updated: 29/Oct/23  Resolved: 14/Sep/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: David Percy Assignee: Matt Olma
Resolution: Fixed Votes: 0
Labels: query-director-triage, query-product-urgency-2, query-product-value-1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Optimization
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: QO 2023-08-07, QO 2023-08-21, QO 2023-09-04, QO 2023-09-18
Participants:

 Description   

When we serialize a projection, as in {$set: {a: _}}, we need to be careful when the expression happens to be an object-expression, such as {b: "$_id"}. Currently, serialize() incorrectly produces {$set: {a: {b: "$_id"}}} which has a different meaning.


For example, set up a collection like this:

db.c.drop();
db.c.insert({_id: 123, a: [{z:99}, {z:100}]});

Run this query:

db.c.aggregate([ {$set: {a: {$ifNull: [null, {b: "$_id"}]}}} ])

On a non-sharded collection, this correctly returns docs where the entire field 'a' is replaced with a new document {b: 123}:

{ "_id" : 123, "a" : { "b" : 123 } }

However, if you set up the same example on a sharded collection:

$ mongo --nodb
> st = new ShardingTest({nodes:2, s:1})

$ mongo --port $(pgrep mongos)
mongos> db.c.drop();
mongos> db.c.insert({_id: 123, a: [{z:99}, {z:100}]});
mongos> db.c.createIndex({_id: 'hashed'})
mongos> sh.shardCollection('test.c', {_id: 'hashed'})

And you run the same query, you get a wrong answer:

mongos> db.c.aggregate([ {$set: {a: {$ifNull: [null, {b: "$_id"}]}}} ])
{ "_id" : 123, "a" : [ { "z" : 99, "b" : 123 }, { "z" : 100, "b" : 123 } ] }

Instead of replacing all of 'a', we have added a field 'b' to every array element, and preserved other subfields.


This happens because internally, after optimization, we end up with an AST that we cannot correctly serialize. We have something like (ProjectionNode "a" (ExpressionObject "b" ...)) which we serialize as {$set: {a: {b: _}}}, as is then misinterpreted as (ProjectionNode "a" (ProjectionNode "b" ...)).



 Comments   
Comment by Githook User [ 13/Sep/23 ]

Author:

{'name': 'Matt Olma', 'email': 'matt.olma@mongodb.com', 'username': 'mattsimply'}

Message: SERVER-66248: Projection with ExpressionObject does not round trip (serialize + parse)
Branch: master
https://github.com/mongodb/mongo/commit/be4e41fba7726242678ca1be00177f06c6c543a1

Comment by David Percy [ 19/Jul/23 ]

At a high level there are two separate changes we would make:

  1. Introduce a new syntax.
  2. Use the new syntax to avoid ambiguity.

In more detail:

  1. Introduce a new syntax.
    • Note: don't create any new subclass of Expression.
    • Register a new expression parser under the keyword "$expr".
      • This can live in 'src/mongo/db/pipeline/expression.cpp'.
      • In the new parser: check that the syntax is {$expr: _}, with no extra fields, then parse the one argument, and return it.
      • Register it as a stable expression, so that apiStrict:true clients can use it.
      • Don't put any minimum version or FCV check on it. It's not necessary because this isn't a new operator: it disappears during parsing.
    • Write some tests:
      • Include an example such as {$match: {$expr: {$expr: {$expr: {$eq: ["$x", 123]}}}} to show that this doesn't cause any problem with $match.  This example will serialize as {$match: {$expr: {$eq: ["$x", 123]}} because the outermost $expr parses as ExprMatchExpression and the other two disappear during parsing.
      • Include an example such as {$set: {a: {$expr: {b: 1}}}}.  This will preserve toplevel fields besides 'a', but not any subfields of 'a', because {b: 1} is parsed as ExpressionObject, and replaces the entire contents of 'a'.
        • This test will fail on a sharding passthrough, because the stage will serialize incorrectly as {$set: {a: {b: 1}

          }}.  Step 2 will fix this.

      • Include some syntax-error cases such as {$expr: 123, extra_field: 456}.
      • Test that {$expr: [123]} evaluates to an array, [123].  Many expressions treat an array as an argument-list, and would interpret [123] as a single argument, 123.  That behavior doesn't make sense for $expr, because it always takes a single argument, and because the existing syntax {$match: {$expr: [false]}} interprets [false] as an array.
  2. Use the new syntax to avoid ambiguity.
    • In 'ProjectionNode::serialize', when we serialize an expression, check whether it's ExpressionObject, and check whether FCV is latest. If so, wrap the serialized expression in {$expr: _}.
      • It's important to check FCV because if any nodes in the cluster are an older binary version, their parsers won't accept this syntax.
      • ExpressionObject is the only case where wrapping is necessary to avoid parsing ambiguity.  It would be valid to wrap with {$expr: _} for all expressions, but we should avoid adding extra syntactic noise to explain output.
    • Write some tests:
      • In C++, construct a projection that contains ExpressionObject, serialize it, then parse it. Make sure the parsed AST is the same as the original.
      • In JS, try an example like {$project: {a: {$ifNull: [null, {b: 123}

        ]}}}.  Explain will show it as {$project: {a: {$expr: {b: 123}}}}, because the $ifNull is optimized away, and $expr was necessary to disambiguate.

      • Enable the tests from step 1 on sharding passthrough suites.
    • Existing multiversion tests will make sure we don't send new syntax to old binaries in a mixed-version cluster.

In total I think the amount of work is comparable to adding a new expression.  The code change is smaller than that, but the testing is larger.  It might make sense to split into two tickets, and then each one might take a few days.

Comment by Steve Tarzia [ 19/Jul/23 ]

david.percy@mongodb.com can you please do some more investigation here and come up with a plan and estimate?

Comment by David Percy [ 06/May/22 ]

One way we can solve this is to allow the query to use a keyword ($expr ?) to mark where the projection ends and the expression begins.

{$set: {a: {$expr: {b: "$_id"}}}}

This wouldn't need to be a special case in $set / $project / $addFields: we could register a $expr as an expression parser. It wouldn't need a new subclass of Expression either: the parser function could just parse its argument and return that.

Generated at Thu Feb 08 06:04:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.