[SERVER-58809] Optimize $unwind + $group _id, to avoid blocking/spilling Created: 23/Jul/21  Updated: 12/Jan/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: David Percy Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Execution
Participants:

 Description   

Currently this is an antipattern:

{$unwind: "a"}
{$match ...}
{$group: {_id: "$_id", ...}}

because $group is a blocking stage, and can spill if the data is big enough. We recommend something like this instead:

{$set: {a: {$filter ...}}}

This performs better because it operates on one document at a time.

But the first version is nicer in some ways:

  • You can easily view intermediate results:
    • by commenting out stages,
    • or in Compass.
  • You might not need to learn two versions of every operator ($match/$filter, $addFields/$map, $group/$reduce).

We could make it perform better by doing a streaming group (in this narrow case).

  • Streaming $group is valid when documents are clustered by the group key.
  • Documents in a collection are clustered by _id (because we have a unique, non-multikey index on _id).
  • $unwind preserves this (if it unwinds one document at a time).
  • $match preserves this.
  • $project/$set can preserve this, depending on which paths they write.

Generated at Thu Feb 08 05:45:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.