[SERVER-533] Aggregation stage to randomly sample documents Created: 11/Jan/10 Updated: 06/Apr/23 Resolved: 20/Jul/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework, Index Maintenance, Usability |
| Affects Version/s: | None |
| Fix Version/s: | 3.1.6 |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | Janitha Karunaratne | Assignee: | Charlie Swanson |
| Resolution: | Done | Votes: | 183 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Global, all environments |
||
| Description |
|
We've decided to go about this by adding a new aggregation stage: $sample. Given a positive integer, the stage will pseudo-randomly choose that number of documents from the incoming stream of documents, which is implicitly the entire collection when $sample is the first stage in the pipeline. Note that this ticket will only track the aggregation stage functionality, and this implementation will be very slow until Original description: A easier approach would be requesting a random item directly from mongo given a query photos.find( {"author":"johndoe"}).random() photos.random_one( {"author":"johndoe"}) |
| Comments |
| Comment by Githook User [ 20/Jul/15 ] |
|
Author: {u'username': u'cswanson310', u'name': u'Charlie Swanson', u'email': u'charlie.swanson@mongodb.com'}Message: |
| Comment by Charlie Swanson [ 09/Jul/15 ] |
|
davidn the short answer is yes, the algorithm is sampling without replacement. However, it is not guaranteed that all returned documents will be unique, as that is not a guarantee of the query system (see here) Jeroenooms while sampling with replacement would be nice, it is out of scope for this part of the project. If this is an important functionality for you, please open a new issue so we can track that work. |
| Comment by Jeroen Ooms [X] [ 29/Jun/15 ] |
|
Actually if there would be an additional option for sampling with replacement, that would open up some very powerful statistical applications. |
| Comment by David Nadle [ 29/Jun/15 ] |
|
Based on the description it sounds like this is sampling without replacement. Right? |
| Comment by Charlie Swanson [ 29/Jun/15 ] |
|
rgpublic, I have updated the description to match what we plan to do. |
| Comment by rgpublic [ 29/Jun/15 ] |
|
Just wondering: If this feature is implemented... What if I need multiple random items? Say 4 random photos out of a collection of 100000 photos. Will I have to call this option multiple times and hope that it isn't the same item again until I have enough items? Wouldn't that be inefficient? Should this be implemented in the database as well? |
| Comment by Issel Guberna [ 21/Jun/15 ] |
|
+1!! |
| Comment by ahmed talha khan [ 15/Jun/15 ] |
|
+1 for this feature. |
| Comment by Andrei Odeski [ 13/Apr/15 ] |
|
+1 For this feature we have been waiting for quite some time, lets push this for the next sub release like 3.1. Need a way to efficiently generate random records without using hacks like this db.yourCollection.find().limit(-1).skip(yourRandomNumber).next() |
| Comment by Jeroen Ooms [X] [ 06/Apr/15 ] |
|
+1 for built-in random sorting. Taking a random sample from a dataset is required by many applications in statistics and data-science in order to generalize properties of the sample to the larger dataset. |
| Comment by Dissatisfied Former User [ 27/Jan/15 ] |
|
+1 for a $random option to aggregation sort, but this would also be useful in normal queries. query.order_by('$rand') or some-such. Basically switch the sort mechanism from "quick sort" (or whatever is actually in use |
| Comment by Callum Gare [ 28/Sep/14 ] |
|
If this does happen then a way to seed the random so a particular result can be reproduced if needed would be much appreciated. By me at least |
| Comment by Thomas Lorimor [ 05/Jun/14 ] |
|
It seems like pulling random records could be accomplished easily by simply adding a new $random expression to the aggregation framework. Then you can, say, query for 5 random people of age 25: db.foo.aggregate( }, }, ] |
| Comment by Ryan Hamilton-Schumacher [ 14/Jan/14 ] |
|
Post a new question on Stackoverflow or use the Google+ community page, this is not the place for these questions. |
| Comment by Antonio C Nalesso [ 14/Jan/14 ] |
|
Hi there, I have tried some of the solution from that stackoverflow link, but they seem not to work properly. This seems to be working fine, however, I am aware that it may not scale right? Thanks you with regards, On Tuesday, 14 January 2014, 7:19, Ryan Hamilton-Schumacher (JIRA) <jira@mongodb.org> wrote: Ryan Hamilton-Schumacher commented on Someone has to implement it, but I would guess it will take some time in both design and execution to do it right. I would not hold your breath but use one of the various alternatives http://stackoverflow.com/questions/2824157/random-record-from-mongodb/5517206 – |
| Comment by Ryan Hamilton-Schumacher [ 14/Jan/14 ] |
|
Someone has to implement it, but I would guess it will take some time in both design and execution to do it right. I would not hold your breath but use one of the various alternatives http://stackoverflow.com/questions/2824157/random-record-from-mongodb/5517206 |
| Comment by Antonio C Nalesso [ 14/Jan/14 ] |
|
Dear Ryan,
On Tuesday, 14 January 2014, 2:23, Ryan Hamilton-Schumacher (JIRA) <jira@mongodb.org> wrote: Ryan Hamilton-Schumacher commented on @Antonio there is nothing to merge. See the Fix Version === "planned but not scheduled" – |
| Comment by Ryan Hamilton-Schumacher [ 14/Jan/14 ] |
|
@Antonio there is nothing to merge. See the Fix Version === "planned but not scheduled" |
| Comment by Antonio C Nalesso [ 13/Jan/14 ] |
|
Hi there, Kind regards, |
| Comment by Jaromir Muller [X] [ 05/Jan/14 ] |
|
+1 |
| Comment by Derk-Jan Karrenbeld [ 28/Oct/13 ] |
|
@Victor Hooi, you can, by skipping a random number of records and then selecting one. Yay for skip. |
| Comment by Victor Hooi TEST [ 28/Oct/13 ] |
|
I'm working on a product recommendation system, and for one mode, we needed a firehose of randomised products. Doing a count and selecting based on ID isn't a viable option, since we can't assume that the product IDs are contiguous. It would be nice if MongoDB offered some kind of syntax to handle this easily. |
| Comment by Alexandre Tolstenko [ 25/Sep/13 ] |
|
This feature will help me a lot! |
| Comment by David Nadle [ 25/Aug/13 ] |
|
I have a need for random sampling of a collection too. I hope this comment helps point the way to a solution. There are two kinds of random sampling: without replacement and with replacement. People above seem to be divided about which type they need. I believe that sampling without replacement is more commonly needed. I think the best way to implement both types is with a random iterator. In the case of sampling with replacement (drawing balls from the urn and putting them back), the iterator would return any document in the cursor with equal probability, and never reach the "end." In the case of sampling without replacement, The iterator would have to internally remove iterated docs from the cursor before selecting the "next" one from the remaining sample with equal probability. This cursor would end when all docs have been iterated. An easy but much less efficient way to get sampling without replacement would be to just apply a random shuffle to the cursor before iterating begins, then use the normal iterator. I bet this would an acceptable solution to many if not most use cases. Because MongoDB is designed around iterating results and not list based access, the most efficient place to implement these features is internal to MongoDB. I would really benefit from sampling without replacement and hope you'll schedule this feature. |
| Comment by Daniel Albert [ 30/Jul/13 ] |
|
I need this feature too. The cookbook's solution is not a good option: I have detected some documents are more frequent than others. |
| Comment by Dmitri Cherepovski [ 30/Jul/13 ] |
I have a large collection of users, stored in MongoDB, and I need to implement some split tests on small randomly chosen groups of users. And I want those groups to be random each time, so I don't add random 'split ID' to user records stored. Random queries are useful, please implement them ^____^ |
| Comment by felix terkhorn [ 03/Jul/13 ] |
|
Agreed, this would be a great feature to have! |
| Comment by Xianhang Zhang [ 29/May/13 ] |
|
Would love to see this implemented! |
| Comment by Isuri Jinasena [ 09/Apr/13 ] |
|
I wish a random option would be implemented. Any news about .random being featured in a upcoming release? |
| Comment by Martin Elvar [ 05/Apr/13 ] |
|
This really needs to be implemented. Really frustrating that you need to add extra keys to your document, in order to achieve a random result, also it feels really hacky. |
| Comment by Alexey Kamensky [ 20/Mar/13 ] |
|
@dimonb, Dmitry, |
| Comment by Dmitry Balabanov [ 21/Dec/12 ] |
|
Alexey, |
| Comment by Evheny Nemov [ 21/Dec/12 ] |
|
Pleeeeeeeeeeeeeeeeease, implement it. |
| Comment by Alexey Kamensky [ 20/Dec/12 ] |
|
Currently working on large collection. The solution with random parameter is a real pain. We often need to get a random I think ).sort( {"$random":1}) |
| Comment by Evheny Nemov [ 07/Dec/12 ] |
|
I want ).sort( {"$random":1}) |
| Comment by Tristan Launay [ 08/Nov/12 ] |
|
I really wish there was a built-in, simple, no non-sense way to get random data from a database. A .random() option would totally fit my needs. Here's a simplified overview of my application and needs. I am using 3 tables: A (user) B (product) and C (score). The industrial process behind these tables rely on the user (a from A) picking a product (b from B) and giving it a score (c in C, that corresponds to a unique pair (a, b)). However we would like to te able to estimate the average score over the whole db. We do not have every score possible (and will not ever), and the user usually scores products he/she's "interested" in. This is why we would like to randomly draw a user and a product, and ask the user to note this product so as to estimate the average note à la Monte Carlo. |
| Comment by Fredy [ 15/Aug/12 ] |
|
i need this function to draw random question on my quiz application, completely random. i hope this would completely resolved. |
| Comment by Gerard Braad [ 07/Aug/12 ] |
|
Currently modeling a language test system using Mongo as the back-end storage. For generation of preparation test sets I would like data to be retrieved as a random set from a collection. At the moment I realize this will involve a lot more work on the back-end code than necessary... ... I wish a random option would be available. |
| Comment by David James [ 15/Jun/12 ] |
|
Bu-bump! The comments above are spot on. I'm putting a good amount of data into Mongo. I want to query some of it randomly. Having the database know how to do this just makes sense. I've paid a lot more attention to random numbers in the past year or so. Enough to say this: most people (including myself) have a tendency to mess up randomness in their code without even realizing it. This should be solved well by the MongoDB engineering team so application developers don't bork it up over and over again. Surely the 10gen team can think it through and come up with a good solution for most people? I'm not saying this is easy or trivial to fix; this ticket raises some interesting points! (1) The skip and limit modifiers don't scale well. (2) The random number field index "solution" is creative but still a hack. (3) ... (And there are lots of details I haven't examined.) This is something I take for granted with relational DBs. So, I'm currently on a mission to raise awareness about this. I think I'll write a blog post. The title? "MongoDB loses, randomly, to relational databases!" |
| Comment by Kevin Lacker [ 10/Jan/12 ] |
|
The cookbook solution is not actually random if you use it more than once, in the sense that multiple lookups will not be independent. If a particular entry has a random field that is particularly far away from adjacent entries, it will be more likely on each lookup. So it doesn't really work if you are doing the lookup more than once, which is pretty common. |
| Comment by Andy Baird [ 14/Dec/11 ] |
|
Just started using Mongo for a Facebook game I'm making. Random order on a result set is quite useful for many different game aspects as a way to involve chance as a game element. Maybe that's not a major use case, but I was really surprised that Mongo didn't already support this. |
| Comment by Michael Korbakov [ 28/Nov/11 ] |
|
This feature is also really useful when you need to compare different indexing schemes for huge collection. Doing complete indexing can easily be as long as full day per index. Whey all you need is compare index entry size per document it's better to do experiments on random subset of full collection. |
| Comment by Kendrick Taylor [ 13/Nov/11 ] |
|
This would be a very valuable feature, not to mention the solutions to roll your own aren't very robust, and the use case is pretty common: for us it's displaying randomized data in a right rail, other articles, users, etc. |
| Comment by Leo Dirac [ 28/Oct/11 ] |
|
@Eliot, I like the random walk into the B-tree idea. O(X * log(N)) would be a vast improvement over the O(X + N*log(N)) solution that mysql appears to use. Would this work on result results without indexes? Like capped collections? |
| Comment by Rachel Willmer [ 05/Oct/11 ] |
|
This was closed as "Won't Fix" 18 months ago. Why? You're asking everyone who needs this feature to reimplement their own solution, whereas in most dbs that people might be familiar with, this would be considered part of the standard functionality. Any change on this stance in the last 18 months? |
| Comment by Jeff Judge [ 24/Mar/11 ] |
|
Alberto, Two good use cases with email marketing: 1) Companies like to A/B test email copy and it's helpful to grab a couple sets of random subscribers (say 1000 each) to compare engagement 2) When dealing with large sets of data (say 6MM subscribers), it's helpful to grab a 1000 random records to determine what percentage of the records match specific types of attributes. For example, if we know 673 people out of the 1000 are interested in receiving daily deals, we can apply statistical analysis to the larger set to see approximately how many people in total would be interested. It's better to do this analysis against a smaller data set for performance reasons, then go create the segment in the background after you're done doing your analysis. My company is currently dealing with different ways to do #1 above with CouchDB, and we've been considering moving our user attribute data to MongoDB because of it's flexibility. Creating a random collection in Mongo would be super helpful for us. Thanks, |
| Comment by Alberto Lerner [ 27/Jun/10 ] |
|
phpMoAdmin, None of the two solutions on the Cookbook address a find(). They apply to one random element of the result with findOne() and to a map reduction over a sample of data, respectively. Sorting animals per race in a compound key with gender and age... yes, that seems rather specific. If that is the killer feature for more apps, all I'm asking is a bit more information so to to figure out how to make the important cases faster, if we were to make this a feature in the server. |
| Comment by phpMoAdmin [ 27/Jun/10 ] |
|
With the cookbook solution, wouldn't the results of find() (not findOne() ) always be returned in the same order? Yes, the starting point is random, but subsequent results based on $gt/$lt would work off of the fixed "random" key in the data. Alberto, are you suggesting that a randomly-ordered resultset is an esoteric usage-case? |
| Comment by Alberto Lerner [ 24/Jun/10 ] |
|
Loic, I think what you mean is that you have a case that doesn't match the ones described by the receipe. The first case requests a single random document, not many, as you assumed. So, yes, the techniques in there do not match all the cases. The behavior you suggest is reasonable. But we'd like to see usage cases that feel less esoteric that would justify building that machinery inside the server. |
| Comment by Loïc Faure-Lacroix [ 23/Jun/10 ] |
|
I don't believe that the cookbook recipe is an actual solution. Even if you query a random number and each document have a random number in them, it wouldn't work if you want to return more documents than the actual query match. To me, I'd see a sort this way, db.coll.find().sort( {'field': RANDOM}) that way it is possible to walk a cursor. Internally, there could be two behaviour 1. return a random item up to the limit of queried elements and no elements can be returned more than once skip would have no effect and sorts after the first random sort would have no effect. For example: db.animals.find().sort( {'sex': 1, 'race': random, 'age': 1}) the sort on sex would work, then the sort on race would randomize races but age sort should be discarded or the sort could be done but if we sort many values of race, it probably would be inneficient to sort on randomized values. |
| Comment by Alberto Lerner [ 27/May/10 ] |
|
PhpMoAdmin, I'm a new member of the 10Gen team. Granted, I'm learning about its internals everyday. Take the Jonh Doe query for example. The cookbook recipe portrays it as a query over the db.docs collection. 'author' would be equivalent to 'key' in that example and the collection would have the extra random attribute. Now, suppose you want { key : 2 }of that collection but just a random element of the result. Instead of opening a cursor over that search criteria, we take a draw and issue what amounts to a lookup query instead. For instance, you could have used Math.random() and asked for { key : 2, random : { $gte : <your random draw> }. Because the docs in that collection have a random attribute of their own, this new query now can be solved via a lookup. A caveat: there needs to be an index covering 'key' and 'random'. You can do an explain() on the examples yourself. |
| Comment by phpMoAdmin [ 27/May/10 ] |
|
From my understanding you are not partially executing the query, you are getting a count from a cursor, then moving the cursor position and then retrieving a single object. Unless establishing a cursor or using the skip function actually queries data than this should not be a time-impacting action beyond the time it takes to retrieve the single object. Perhaps someone with knowledge of Mongo internals can chime in if my perception is correct on this? |
| Comment by Alberto Lerner [ 27/May/10 ] |
|
PhpMoAdmin, taking a random draw out of the result count is always a good a possibility, too. If one could trade the extra storage space for time, the random attribute allows for executing the query partially and still getting the desired effect. |
| Comment by phpMoAdmin [ 27/May/10 ] |
|
Alberto, feel free to add the JS version of my previous post to your cookbook. It serves the same purpose as your cookbook's db.docs.findOne example but does not require adding any "random" data into the collection. |
| Comment by Alberto Lerner [ 27/May/10 ] |
|
There's a new and improved version of the cookbook recipe. For now, this functionality could be obtained with client-side – efficient – logic. http://cookbook.mongodb.org/patterns/random-attribute/ Thanks Aaron, Dwight, and everybody that contributed corrections and comments. |
| Comment by phpMoAdmin [ 25/May/10 ] |
|
Just figured out a way to get a single random object (basically findOne) with a find() criteria via PHP; the same concept should be applicable in JS and other languages: $cursor = $this->selectCollection('myCollection')->find(array('color' => 'red')); Can this logic be packaged up into the Mongo core for $col->sort('$random' => true)->findOne() ? |
| Comment by phpMoAdmin [ 25/May/10 ] |
|
+1 on the option for: ).sort( {"$random":1}) Not sure how this would perform, but conceptually something like: Just a thought, PHP has a shuffle() function that randomizes the order of an array; perhaps the same internal-logic can be used on the objects within a collection or index before retrieval? |
| Comment by Aaron Staple [ 20/May/10 ] |
|
Here's a "random" idea. I believe our btree buckets are stored back to back on disk, so we can just pick one randomly (not using btree structure, just treating disk representation as an array of buckets) and then pick a random bit inside the chosen bucket. If the bit is inside a key, we return the document for that key. Otherwise we try again. This works if the keys are uniform size (eg auto generated id which is common). If they're not, we can discount the probability of returning a key by its size by doing a second probability test to determine whether or not to return the document. If we want to return lots of random docs, we can probably work things to scan each btree bucket at most once - maybe by generating the number of elements to return per bucket using a poisson distribution if that works. Of course we won't get the same data locality as with the prepopulated random values described in the cookbook. Anyway, just a brainstorm. |
| Comment by Alberto Lerner [ 19/May/10 ] |
|
I have created a cookbook recipe based on Janitha's and Jared's suggestions. Please, have a look at http://cookbook.mongodb.org/patterns/random-attribute/ and let me know if these techniques solve the issue. |
| Comment by Jared Rosoff [ 19/Mar/10 ] |
|
What if random sample was an option to a map-reduce instead of a modifier for find? This avoids the btree entirely since you're already doing a table scan and simply reduces the number of records actually fed to the map phase. For example, we have a map-reduce job that calculates statistics over a large set of docs. There's no need for us to calculate this over the whole collection, hitting just a random set of around log N of the N documents in the collection would be sufficient. However, it's much easier to write this algorithm as a map-reduce job than to pull it out of the db and then do the calculation externally. But as a map-reduce job, we only have the choices of running it on the whole collection, or on some non-random subset using a query. Would be nice if you could do something like this: db.runCommand( { mapreduce: <collection>, map: <map_function>, reduce: <reduce_function>, [, sample: <total_number_of_random_records_to_send_to_map>] }And have mongo choose <sample> random records from the collection to send to map. Seems like this would let statistical algorithms written in map reduce to run in sublinear time. |
| Comment by Colin Mollenhour [ 23/Feb/10 ] |
|
Could this be implemented as a sort order modifier akin to SQL's ORDER BY RAND()? Combined with limit you can choose 1 or X and iterating would be no different than on a sorted set. Of course it would need to only randomize records up to the limit for performance, but I don't know if this is a problem with the way cursors are implemented. photos.find( {"author":"johndoe"}).random().limit(5) ).sort( {"$random":1}) |
| Comment by Eliot Horowitz (Inactive) [ 13/Jan/10 ] |
|
Could do a random walk into the btree |