[SERVER-22123] Add an option to the $sample stage to specify weights to use in the sampling. Created: 11/Jan/16 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | 3.2.0 |
| Fix Version/s: | None |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Lukas Wagner | Assignee: | Backlog - Query Optimization |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | grab-bag, stage | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Query Optimization
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Participants: | |||||||||
| Description |
|
Specifying this option would prevent any optimized random cursor implementation from the storage engine, and would always use a top-k random sort, with the random value used to sort being multiplied by the specified weight. For example:
|
| Comments |
| Comment by Lukas Wagner [ 27/Jan/16 ] | |
|
Hi Charlie, thanks for proposing the issue internally. | |
| Comment by Charlie Swanson [ 26/Jan/16 ] | |
|
Hi Lukas, I've proposed this internally, and if/when we all agree on the syntax and semantics, we'll work on a fix. I've updated this ticket to reflect the revised plan. I've also removed the backport request, since this is a new feature, and we generally do not backport new features to released versions. As for the duplicates, the $sample stage is logically a sample without replacement, but we cannot guarantee there are not duplicates because of our isolation semantics (see here for more details). This is not a trivial issue to fix, and I don't think we would want to add de-duplicating logic to only the $sample stage, since this is a general problem that should be solved everywhere. | |
| Comment by Lukas Wagner [ 16/Jan/16 ] | |
|
Hi Charlie, yes that would be great. | |
| Comment by Charlie Swanson [ 15/Jan/16 ] | |
|
Hi Lukas, I think your use case might be addressed by something like the following?
This wouldn't be so hard to do. Let me know if that would work for you, and I'll confirm that this makes sense from our end. | |
| Comment by Lukas Wagner [ 13/Jan/16 ] | |
|
Hi Charlie, yep, that would be an option. However, there is no option to weigh the randomness upon some kind of rating system. Maybe it would be a better approach to add that option to the sample stage. | |
| Comment by Charlie Swanson [ 13/Jan/16 ] | |
|
Would the $sample stage do what you wanted? | |
| Comment by Lukas Wagner [ 13/Jan/16 ] | |
|
Hi Charlie, you'd need it for any kind of randomized access onto a collections data. Right now there is no possibility whatsoever. Let's use a real world example that is commonly used. | |
| Comment by Charlie Swanson [ 13/Jan/16 ] | |
|
Hi Lukas, Before we go forward with implementing this (sorry if you've already started), can you describe why you need this expression? What are you using it for? We have some concerns that this may add some subtle complexity to aggregation's optimizer. This would be the first expression that would return different results depending on which order you called it in, or if you called it multiple times, which will make reasoning about which optimizations are safe to apply harder to analyze. | |
| Comment by Lukas Wagner [ 13/Jan/16 ] | |
|
Hi Charlie, thanks for the heads up on contributing guidelines and the agreement. I was aware of the guideline but I had yet to sign the agreement. It's all done now. Regards, | |
| Comment by Charlie Swanson [ 12/Jan/16 ] | |
|
m3t4lukas, I'm excited to hear that you are working on a patch! If you're planning to submit a pull request to have this merged into the server project, here is a useful guide to getting started. In particular, you'll have to sign the Contributor's Agreement. Apologies if you already knew this, or already signed that. I'll assign this ticket to myself in the meantime, since I'll likely review your patch, and we can't assign tickets to people outside of MongoDB. Let me know if there's anything I can do to help! | |
| Comment by Lukas Wagner [ 12/Jan/16 ] | |
|
Hi Charlie, what you assumed is correct. If you like you can assign it to me, as I am already working on it. | |
| Comment by Charlie Swanson [ 11/Jan/16 ] | |
|
Hi m3t4lukas, I've filled in the description with what I believe you are asking for, let me know if this is not correct. I've downgraded the priority of this ticket to the default priority. We don't use the priority field when prioritizing new features, so I've changed it to the default to avoid possible confusion in other search results. |