[SERVER-22123] Add an option to the $sample stage to specify weights to use in the sampling. Created: 11/Jan/16  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 3.2.0
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Lukas Wagner Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 1
Labels: grab-bag, stage
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-30405 add expression to generate a random n... Closed
Assigned Teams:
Query Optimization
Backwards Compatibility: Fully Compatible
Participants:

 Description   

Specifying this option would prevent any optimized random cursor implementation from the storage engine, and would always use a top-k random sort, with the random value used to sort being multiplied by the specified weight.

For example:

db.example.aggregate([{
    $sample: {
        size: 100,
        weight: "$myWeightField"
    }
}]);



 Comments   
Comment by Lukas Wagner [ 27/Jan/16 ]

Hi Charlie,

thanks for proposing the issue internally.
Yeah the duplicates part is kind of a nice to have but it works without. It's just to avoid confusion with the users. But at least for me it's not critical to functionality. There might be people that may disagree with me on that part.
Edit: This is kind of a PS but there is weighted randomness built into boost. http://www.boost.org/doc/libs/1_59_0/doc/html/boost_random/tutorial.html#boost_random.tutorial.generating_integers_with_different_probabilities

Comment by Charlie Swanson [ 26/Jan/16 ]

Hi Lukas,

I've proposed this internally, and if/when we all agree on the syntax and semantics, we'll work on a fix. I've updated this ticket to reflect the revised plan. I've also removed the backport request, since this is a new feature, and we generally do not backport new features to released versions.

As for the duplicates, the $sample stage is logically a sample without replacement, but we cannot guarantee there are not duplicates because of our isolation semantics (see here for more details). This is not a trivial issue to fix, and I don't think we would want to add de-duplicating logic to only the $sample stage, since this is a general problem that should be solved everywhere.

Comment by Lukas Wagner [ 16/Jan/16 ]

Hi Charlie,

yes that would be great.
An option for allowing duplicates or not would be great, too. As far as I've seen you have no choice but to accept that there can be duplicates right now.

Comment by Charlie Swanson [ 15/Jan/16 ]

Hi Lukas,

I think your use case might be addressed by something like the following?

db.foo.aggregate([{$sample: {size: 10, weight: "$myWeightField"}}])

This wouldn't be so hard to do. Let me know if that would work for you, and I'll confirm that this makes sense from our end.

Comment by Lukas Wagner [ 13/Jan/16 ]

Hi Charlie,

yep, that would be an option. However, there is no option to weigh the randomness upon some kind of rating system. Maybe it would be a better approach to add that option to the sample stage.

Comment by Charlie Swanson [ 13/Jan/16 ]

Would the $sample stage do what you wanted?

Comment by Lukas Wagner [ 13/Jan/16 ]

Hi Charlie,

you'd need it for any kind of randomized access onto a collections data. Right now there is no possibility whatsoever. Let's use a real world example that is commonly used.
Imagine a database that has to serve ads. Now of course there are some criteria upon which ads to show to a user are chosen. There might be some kind of rating behind it. Now imagine you had several hundreds of ads you could serve to a user. What you'd need to do now is query them all, which uses up network resources, and randomize on the CDN stage. That is rather inefficient. If you would not randomize at all the user would see the same ads all the time which is not what you want since when the user did not click on the ad the first several times it was served to him it is highly unlikely he will in the near future. The most efficient way to do that would be to randomize and filtering upon randomization as early in the aggregation pipeline as possible. That would save on ram and it would in particular in that case save on networking resources (several hundred ads transferred from db to cdn server vs only the one needed [per request!!!]).
Another option would be to add a field to the collection with a random number. That approach would have two major disadvantages: for once it would be the same random numbers for each user and each request for a period of time which causes clumping and on the other hand would cause a whole lot of writes every time there is "feeding time".

Comment by Charlie Swanson [ 13/Jan/16 ]

Hi Lukas,

Before we go forward with implementing this (sorry if you've already started), can you describe why you need this expression? What are you using it for?

We have some concerns that this may add some subtle complexity to aggregation's optimizer. This would be the first expression that would return different results depending on which order you called it in, or if you called it multiple times, which will make reasoning about which optimizations are safe to apply harder to analyze.

Comment by Lukas Wagner [ 13/Jan/16 ]

Hi Charlie,

thanks for the heads up on contributing guidelines and the agreement. I was aware of the guideline but I had yet to sign the agreement. It's all done now.

Regards,
Lukas Wagner

Comment by Charlie Swanson [ 12/Jan/16 ]

m3t4lukas, I'm excited to hear that you are working on a patch!

If you're planning to submit a pull request to have this merged into the server project, here is a useful guide to getting started. In particular, you'll have to sign the Contributor's Agreement. Apologies if you already knew this, or already signed that.

I'll assign this ticket to myself in the meantime, since I'll likely review your patch, and we can't assign tickets to people outside of MongoDB.

Let me know if there's anything I can do to help!

Comment by Lukas Wagner [ 12/Jan/16 ]

Hi Charlie,

what you assumed is correct. If you like you can assign it to me, as I am already working on it.
Thanks for filling me in on priorities. I just used that priority since I can't continue to work on my current project without that feature.

Comment by Charlie Swanson [ 11/Jan/16 ]

Hi m3t4lukas,

I've filled in the description with what I believe you are asking for, let me know if this is not correct.

I've downgraded the priority of this ticket to the default priority. We don't use the priority field when prioritizing new features, so I've changed it to the default to avoid possible confusion in other search results.

Generated at Thu Feb 08 03:59:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.