[SERVER-47199] add 'remove' action to $merge whenMatched Created: 31/Mar/20  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Remi Jolin Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 1
Labels: qopt-team
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Query Optimization
Participants:

 Description   

It would be helpful to have a "remove" action to the $merge whenMatched option.

Use case :

remove records from this collection if they exist (somehow) in that other collection.



 Comments   
Comment by Remi Jolin [ 19/Oct/20 ]

asya Yes, it could be a join in delete (do you mean $remove ?) (and in update, ...).

I thought that you were putting your efforts on adding functionnalities to $aggregate and I proposed this as a extention : a delete/remove could be seen as "updating a record to null"...

I don't mind how you implement the function. The idea is to mimic the

delete from T2 where id in (select id from T1 where ....)

SQL statement

The current implementation requires to update each record you want to delete twice. First to mark the record as a record to delete, then a second "update" to delete it. This as a real impact on replication logs when you delete 1000s or 10.000s of records in a single batch.

Comment by Asya Kamsky [ 19/Oct/20 ]

Ok, so this request isn't specifically for $merge behavior but to have join functionality in delete command, right?

Comment by Remi Jolin [ 17/Oct/20 ]

asya Can you give me an example where you delete record from one collection based on a condition met by records on another collection ?

And having job done in a single query seems simpler than having to run 2 queries. Old RDB engineers don't understand why you should use thoses tricks to do something that is much simpler to do with sql.

Replacing all existing fields is a bit more complicated when you deal with shadered collections. It whould be much simpler if you could directly remove/delete the record. Usually a language should simplify the work for the programer, not make it more complicated. Having to always find workarounds for queries you would write with a simple sql query is time consuming.

Some of my collections have tens of 10^9 records and a single delete can handle 500k to 1M records. Don't know if it is what you call "too many". The goal is not to have to get those _ids going on the network.

 

Comment by Asya Kamsky [ 16/Oct/20 ]

Another thought for a workaround - is there a reason not to just do query of one collection for {{_id}}s and then pass that to remove command? Or is it expected that there would be too many?

Comment by Asya Kamsky [ 16/Oct/20 ]

rj-10gen@arsynet.com I'm not sure I understand the issue with your workaround - it seems fine. I'm not really sure why you need to use $merge on the original collection - you just copied records to archive using some criteria, can't you use the same criteria to delete this data?

In addition, I'm not sure what you mean by the workaround leading to "chunk split" - you can just replace all existing fields in the current records you're merging with with to_delete:true so the document is guaranteed to be smaller before you delete it.

Comment by Carl Champain (Inactive) [ 06/Apr/20 ]

Thank you rj-10gen@arsynet.com!

We are passing this ticket along to the appropriate team for additional investigation.

Comment by Remi Jolin [ 06/Apr/20 ]

in sql, you would write it as 

 

delete from T2 where id in (select id from T1)

My use case is :

I have a collection (schedule) with new data arriving each month. Let say it is train schedules published for the next 12 months. So obviously some new records overlap some old ones and I mark the old one with a flag "active = false". 

Then I copy those not active records to an archive collection (schedule_archive). And, at the end, I want to delete the copied records from the main collection. But only if they have been copied.

With 'remove' it would look like

 

db.schedule_archive.aggregate([
    {match: {some filter}, 
    {$merge: {into: 'schedule', 
              whenMAtched: 'remove'
              whenNotMatched: 'discard'}}
])

 

To do that now I have to use 2 steps : the first one is to add a new field {_to_delete: true} to the records I want to delete and the actually remove them like this

db.schedule_archive.aggregate([
    {match: {some filter}, 
    {$merge: {into: 'schedule', 
              whenMAtched: [{$addFields: {_to_delete: true}}], 
              whenNotMatched: 'discard'}}
])
 
db.schedule.remove({_to_delete: true})
 
( a partialFiltered index on _to_delete helps ;-) )

 It works but, as it add data to the collection, it may lead to chunk split which would not occur if we were deleting the records in the first step and that are not necessary because those records will be eventually removed.

 

 

Comment by Carl Champain (Inactive) [ 31/Mar/20 ]

Hi rj-10gen@arsynet.com,

Thank you for the report.
Can you please provide a concrete example? This will help us move this ticket along.

 

Generated at Thu Feb 08 05:13:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.