[SERVER-2094] distinct cheat with indexes Created: 14/Nov/10 Updated: 07/Apr/23 Resolved: 23/Jan/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance |
| Affects Version/s: | None |
| Fix Version/s: | 2.5.5 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Eliot Horowitz (Inactive) | Assignee: | hari.khalsa@10gen.com |
| Resolution: | Done | Votes: | 31 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||
| Description |
|
Right now distinct will pull data out of indexes, but will scan the entire thing. Correct way to implement is probably adding a method to a btree bucket like getDisinctKeys( int keyOffset ) |
| Comments |
| Comment by Githook User [ 19/Mar/14 ] | ||||||||||||||
|
Author: {u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}Message: | ||||||||||||||
| Comment by Githook User [ 19/Mar/14 ] | ||||||||||||||
|
Author: {u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}Message: | ||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 24/Jan/14 ] | ||||||||||||||
|
Sebastian, no, a backport isn't possible. This relies on the query execution system going in 2.6. | ||||||||||||||
| Comment by Sebastian Paul [ 24/Jan/14 ] | ||||||||||||||
|
is it possible that this can be backported to 2.4 stable? | ||||||||||||||
| Comment by Githook User [ 23/Jan/14 ] | ||||||||||||||
|
Author: {u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}Message: | ||||||||||||||
| Comment by Githook User [ 17/Jan/14 ] | ||||||||||||||
|
Author: {u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}Message: Revert " This reverts commit 3b5b31f42d390e9da23f66e78a300d997ae653c4. | ||||||||||||||
| Comment by Githook User [ 17/Jan/14 ] | ||||||||||||||
|
Author: {u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}Message: | ||||||||||||||
| Comment by Richard Uren [ 10/Dec/13 ] | ||||||||||||||
|
@Benjamin - Your optimization is awesome ! Thanks x 1000 (the performance improvement I got). | ||||||||||||||
| Comment by Sebastian Paul [ 05/Dec/13 ] | ||||||||||||||
|
same here. Using distinct to get log levels. Index is set 13 million objects in a capped collection | ||||||||||||||
| Comment by Adam Szpakowski [ 03/Oct/13 ] | ||||||||||||||
|
Same problem here, distinct takes ~4s (seconds) on indexed collection. Collection fits in RAM, index is used "nscannedObjects" : 0. > db.runCommand( {distinct: "sab.sessions", key: "customer"}) , Now withouth index, it takes ~6.5 seconds: ) , I've tried both normal and hashed inices. | ||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 30/May/13 ] | ||||||||||||||
|
Ali - nscannedObjects is 0, which implied that is in fact using the index. | ||||||||||||||
| Comment by Ali Watters [ 27/May/13 ] | ||||||||||||||
|
On 2.4.3 – collection with an index on a field with ~100 distinct values, and ~4.5 million docs. ~8s on ok server (working set in memory) to run > db.action_log.distinct('action'); Distinct does not use the index action_1, appears to be running a table scan. > db.runCommand( { distinct: 'action_log', key:'action'}); , Something feels broken as | ||||||||||||||
| Comment by Hugo Lopes Tavares [ 05/Dec/12 ] | ||||||||||||||
|
@Benjamin, thank you very much for this hack! I hope mongodb team write a better distinct asap. | ||||||||||||||
| Comment by Benjamin Jackman [ 30/Nov/12 ] | ||||||||||||||
|
For those stuck with this problem I used a work around My use case did not have a compound key, but perhaps this can be adapted for The idea is to first find the smallest key via sort, Here a js version that could be entered at the command substituting in
| ||||||||||||||
| Comment by Tuner [ 23/Aug/12 ] | ||||||||||||||
|
Same problem like @Sergey Alaev: +1 for the future anyway | ||||||||||||||
| Comment by Jason R. Coombs [ 23/Aug/12 ] | ||||||||||||||
|
Our use case is simple - we have a chat log server where all log entries are stored in a single collection, one field which is the 'channel', indexed. We would like to determine quickly which channels have been logged. It's a small number of items (<50) compared to the number of log entries (>1M). We would like to retrieve the channels without scanning every log entry (or index entry). | ||||||||||||||
| Comment by Kevin McCarthy [ 21/Nov/11 ] | ||||||||||||||
|
One more use case to add to the pile: We have a few million real estate properties with things like city, state, neighborhood, condo building name, etc. We do lots of distincts. For example:
We cache these but the data changes constantly so that only goes so far. We can't really denormalize / precompute these because there are so many combinations of the filter criteria that it isn't practical. | ||||||||||||||
| Comment by wouter alleweireldt [ 21/Nov/11 ] | ||||||||||||||
|
@Eliot: we're keeping track on a set of our data with of a sort of timestamp field (year field/month field/day field), and we need a distinct on each of those fields to return only the dates that have valid data. Right now we need to run 3 queries : distinct on year, dinstinct on month for that year, distinct on day for that month/year, and totals up to (sometimes over) a second on a rather small set of data (1.1 million docs) which is way to slow for our needs (compared to SQL running the very same query, taking only a few millisecs) | ||||||||||||||
| Comment by Sergey Alaev [ 18/Nov/11 ] | ||||||||||||||
|
I am waiting for this ticked too - it is very important for fast queries on denormalized trees The only way i found is denormalized tree, but in this case query 2 returns HUGE amount of duplicate cities. And currently MongoDB's distinct is not much faster than implemented in java on client side. edit: denormalization is the only way to avoid client-side joins, but if you do not want to build different collections for different join combinations, effective grouping or distinct is required. Something like that | ||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 18/Nov/11 ] | ||||||||||||||
|
@wouter - what's the example you have? just to see if its really relevant | ||||||||||||||
| Comment by wouter alleweireldt [ 18/Nov/11 ] | ||||||||||||||
|
Any idea when this could be implemented? |