[SERVER-31831] Improve aggregation set operations for array of objects Created: 03/Nov/17  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Joel Goldfinger Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-31991 Allow n-ary aggregation expressions t... Backlog
Assigned Teams:
Query Optimization
Participants:

 Description   

Currently, it is hard to compute stats / transform data on a nested array of objects where a subset of the fields make up the key (hash) of the object. It is possible to do via $unwind and $group, but that is an issue when operating on multiple fields in the documents at the same time. The other way is to $map using $concat on the key fields -> $setUnion -> $map -> $reduce using $cond with $in, but that is way too slow.

It would be helpful if the set operations allowed specifying the comparison function. When providing a custom comparison function, the $setUnion, $setIntersection, and $setDifference could have a mandatory reduce function to merge duplicates.

{ $setEqual: {
    input: [
      [{a: 1, b: 1, c: 1}, {a: 2, b: 1, c: 3},{a: 2, b: 1, c: 4}]
      [{a: 1, b: 1, c: 2}, {a: 2, b: 1, c: 5}]
    ]
  ],
  as: ["item1", "item2"], // name the two objects being compared at a time
  cond: {
    $and: [
      {$eq: ["$$item1.a", "$$item2.a"]},
      {$eq: ["$$item1.b", "$$item2.b"]}
    ]
  }
}

Result: true

{ $setUnion: {
    input: [
      [{a: 1, b: 1, c: 1}, {a: 2, b: 1, c: 3}, {a: 2, b: 1, c: 4}]
      [{a: 1, b: 1, c: 2}, {a: 2, b: 1, c: 5}]
    ]
  ],
  as: ["item1", "item2"], // name the two objects being compared at a time
  cond: {
    $and: [
      {$eq: ["$$item1.a", "$$item2.a"]},
      {$eq: ["$$item1.b", "$$item2.b"]}
    ]
  },
  reduce: {
    initialValue: { c: 0 },
    in: {
      a: "$$this.a",
      b: "$$this.b",
      c: { $add: ["$$value.c", "$$this.c"]
    }
  }
}

Results: [{a: 1, b: 1, c: 3}, {a: 2: b: 1, c: 12}]



 Comments   
Comment by Asya Kamsky [ 16/Dec/17 ]

Agreed that internal implementation can be more efficient.

Comment by Joel Goldfinger [ 04/Dec/17 ]

It works, but it is pretty slow on a large data set. It would probably be faster if the items could be stored in a hash instead of an array. As implemented, it has to scan the entire initialValue array for each item in the reduce input.

Thanks.

Comment by Asya Kamsky [ 10/Nov/17 ]

devnopt

I agree that $unwind and $group is definitely not the right way to do this as it would be very slow. I'm also not sure if your sample code does what you describe exactly - I think you want to get a set of all unique "a, b" values and then do something else with the leftover field, like add it, right?

This can already be done with $reduce, in other words, replace $setXXXX with $reduce in your proposed syntax and use $let to assign the appropriate element to variables and do calculations. Something like the following:

{"$reduce" : {
	"input" : {"$concatArrays" : ["$l","$m"]},
	"initialValue" : {"$setUnion" : [
		{"$map" : {
			"input" : {"$setUnion" : {"$concatArrays" : ["$l","$m"]}},
			"in" : {"a" : "$$this.a","b" : "$$this.b"}
		}}
	]},
	"in" : {"$let" : {
		"vars" : {"cur" : "$$this"},
		"in" : {"$map" : {
			"input" : "$$value",
			"in" : {"$cond" : [
				{"$eq" : [
					["$$this.a","$$this.b"],
					["$$cur.a","$$cur.b"]
				]},
				{"$let" : {
					"vars" : {"obj" : "$$this"},
					"in" : {
						"a" : "$$obj.a",
						"b" : "$$obj.b",
						"c" : {"$add" : [
							{"$ifNull" : ["$$obj.c",0]},
							{"$ifNull" : ["$$cur.c",0]}
						]}
					}
				}},
				"$$this"
			]}
		}}
	}}
}}

Does this do what the syntax you are proposing would do? If I misunderstood what you are trying to do, please let me know.

Asya

Comment by Mark Agarunov [ 06/Nov/17 ]

Hello devnopt,

Thank you for the detailed example. I've set the fixVersion to "Needs Triage" for this new feature to be scheduled against our currently planned work. Updates will be posted on this ticket as they happen.

Thanks,
Mark

Generated at Thu Feb 08 04:28:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.