[SERVER-53503] Performance issue for $reduce vs. $map Created: 23/Dec/20  Updated: 27/Oct/23  Resolved: 07/May/21

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 4.4.2
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Wernfried Domscheit Assignee: Edwin Zhou
Resolution: Works as Designed Votes: 0
Labels: aggregation-framework, performance
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Sprint: Query Optimization 2021-05-03, Query Optimization 2021-05-17
Participants:

 Description   

I have an array field where I need to replace consecutive values by null, e.g.

{ x: [22, 22, 80, 80, 80, 80, 80, 443, 443, 443, 5223, 5224] }

should be translated to:

{ x: [22, null, 80, null, null, null, null, 443, null, null, 5223, 5224] }

I developed two aggregation operations:

{
   $set: {
      x: {
         $cond: {
            if: { $isArray: "$x" },
            then: {
               $map: {
                  input: { $range: [0, { $size: "$x" }] },
                  as: "idx",
                  in: {
                     $let: {
                        vars: {
                           this: { $arrayElemAt: ["$x", "$$idx"] },
                           prev: { $arrayElemAt: ["$x", { $subtract: ["$$idx", 1] }] }
                        },
                        in: {
                           $cond: {
                              if: { $and: [{ $eq: ["$$this", "$$prev"] }, { $gt: ["$$idx", 0] }] },
                              then: null,
                              else: "$$this"
                           }
                        }
                     }
                  }
               }
            },
            else: "$x"
         }
      }
   }
}

and

{
   $set: {
      x: {
         $cond: {
            if: { $isArray: "$x" },
            then: {
               $let: {
                  vars: {
                     values: {
                        $reduce: {
                           input: "$x",
                           initialValue: [],
                           in: {
                              $concatArrays: [
                                 "$$value",
                                 [{
                                    val: "$$this",
                                    new: {
                                       $cond: {
                                          if: { $ne: ["$$this", { $last: "$$value.val" }] },
                                          then: "$$this",
                                          else: null
                                       }
                                    }
                                 }]
                              ]
                           }
                        }
                     }
                  },
                  in: "$$values.new"
               }
            },
            else: "$x"
         }
      }
   }
}

Both of them are working fine, however the firsts version with `$map: { input: { $range: ...` is around 10 times faster than the second version with `$reduce`.

Is there a reason why `$reduce` is 10 times slower than `$map`? I would expect the second version little faster because array "$x" is read only once. Is there any possibility to improve it?

My collection contains several million documents, the array size vary from 2 to 150k with an average of 50 elements.

Kind Regards
Wernfried

 

 



 Comments   
Comment by Edwin Zhou [ 07/May/21 ]

Hi wernfried.domscheit@sunrise.net,

Our investigation has led us to believe the current implementation of $reduce is working as designed. It appears the actual culprit is the $concatArrays expression used within the $reduce pipeline. When using $reduce, we build a new array by sequentially appending either the original value (if it differs from the previous value) or null. When we append to the new array with $concatArrays, it creates a copy of the array every time it appends arrays.

For example, if we need to sequentially append 1000 values:

  1. We begin by append the first two values,
  2. Then the we copy the resulting array of size 2 in a new array, and append the 3rd element
  3. Then we copy the array of size 3 into a new array and append the 4th one and so on.

This involves a quadratic (N^2) number of element copies, copying an array for each element in the array that $reduce is working on. It does not appear trivial to eliminate the copying without a significant rework to the functionality of $concatArray

When using $map, even though each array element is accessed twice, there is no O(N^2) copying involved. Instead every array value is sequentially replaced by the corresponding element from the original array (or null), preserving its structure.

We hope this answers your inquiry.

Best,
Edwin

Comment by Edwin Zhou [ 24/Mar/21 ]

Hi wernfried.domscheit@sunrise.net,

Great work discovering a faster pipeline to fulfill your use case! It's indeed unusual and unhelpful behavior that two pipelines that perform the same task can have such a significant performance difference, and we agree that it's worth investigating. However, since we're hoping to improve the performance of reduce, I'll change this to an improvement and assign it to the appropriate team to further this investigation.

Thanks,
Edwin

Comment by Wernfried Domscheit [ 24/Mar/21 ]

Actually this version is even faster:

 

   {
      $set: {
         x: {
            $cond: {
               if: { $isArray: "$x" },
               then: {
                  $let: {
                     vars: { val: "$x", prev: { $concatArrays: [[null], "$x"] } },
                     in: {
                        $map: {
                           input: { $range: [0, { $size: "$$val" }] },
                           as: "idx",
                           in: {
                              $cond: {
                                 if: { $eq: [{ $arrayElemAt: ["$$val", "$$idx"] }, { $arrayElemAt: ["$$prev", "$$idx"] }] },
                                 then: null,
                                 else: { $arrayElemAt: ["$$val", "$$idx"] }
                              }
                           }
                        }
                     }
                  }
               },
               else: "$x"
            }
         }
      }
   }

On my test it takes app 3000ms instead of 4600ms

Best Regards
Wernfried

 

 

Comment by Wernfried Domscheit [ 24/Mar/21 ]

Hi Edwin Zhou

Your optimized pipeline requires a proper sorting of the values. In my application this is often the case, however it is not guaranteed. The array could be like this:

{ x: [22, 22, 80, 80, 80, 80, 80, 443, 443, 443, 80, 80, 80, 80, 5223, 5224] }

For the moment I have to stick to the first version.

Best Regards
Wernfried

 

 

Comment by Edwin Zhou [ 23/Mar/21 ]

Hi wernfried.domscheit@sunrise.net,

I apologize for the delay in my investigation. I indeed noticed a similar regression between the two pipelines. The pipeline with $reduce took 8305ms to execute, and the pipeline with $map took 4587ms to execute, which is a significant regression!

I took the time to optimize the $reduce pipeline and removed a $let stage:

{
  $set: {
    x: {
      $cond: {
        if: { $isArray: "$x" },
        then: {
          $reduce: {
            input: "$x",
            initialValue: [],
            in: {
              $concatArrays: [
                "$$value",
                [
                  {
                    $cond: {
                      if: { $not: {$in: ["$$this", "$$value" ]} },
                      then: "$$this",
                      else: null,
                    },
                  },
                ],
              ],
            },
          },
        },
        else: "$x",
      },
    },
  },
};

This pipeline took 5596ms to execute:

{"t":{"$date":"2021-01-12T17:20:25.809-05:00"},"s":"I",  "c":"WRITE",    "id":51803,   "ctx":"conn11","msg":"Slow query","attr":{"type":"update","ns":"test.test_col","appName":"MongoDB Shell","command":{"q":{},"u":[{"$set":{"x":{"$cond":{"if":{"$isArray":"$x"},"then":{"$reduce":{"input":"$x","initialValue":[],"in":{"$concatArrays":["$$value",[{"$cond":{"if":{"$not":{"$in":["$$this","$$value"]}},"then":"$$this","else":null}}]]}}},"else":"$x"}}}}],"multi":true,"upsert":false},"planSummary":"COLLSCAN","keysExamined":0,"docsExamined":200000,"nMatched":200000,"nModified":200000,"keysInserted":0,"keysDeleted":0,"numYields":365,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":366}},"ReplicationStateTransition":{"acquireCount":{"w":366}},"Global":{"acquireCount":{"w":366}},"Database":{"acquireCount":{"w":366}},"Collection":{"acquireCount":{"w":366}},"Mutex":{"acquireCount":{"r":1}}},"flowControl":{"acquireCount":366,"timeAcquiringMicros":269},"storage":{},"durationMillis":5596}}

While this doesn't match the desired performance of $map, it is a significant improvement from the previous pipeline using $reduce.

Have you been able to settle on a pipeline that works best for you?

Best,
Edwin

Generated at Thu Feb 08 05:31:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.