[SERVER-2094] distinct cheat with indexes Created: 14/Nov/10  Updated: 07/Apr/23  Resolved: 23/Jan/14

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: None
Fix Version/s: 2.5.5

Type: Improvement Priority: Major - P3
Reporter: Eliot Horowitz (Inactive) Assignee: hari.khalsa@10gen.com
Resolution: Done Votes: 31
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-1673 Allow distinct to use indexes Closed
depends on SERVER-11939 Implement advanceToNextUniqueKey acce... Closed
Duplicate
is duplicated by SERVER-29244 CLONE - distinct cheat with indexes Closed
Related
related to SERVER-12297 Distinct - nscanned for $in and $or g... Closed
related to SERVER-9507 Optimize $sort+$group+$first pipeline... Closed
is related to SERVER-13271 remove surplus projection from distinct Closed
Participants:

 Description   

Right now distinct will pull data out of indexes, but will scan the entire thing.
Can skip whole regions where the key is the same

Correct way to implement is probably adding a method to a btree bucket like

getDisinctKeys( int keyOffset )



 Comments   
Comment by Githook User [ 19/Mar/14 ]

Author:

{u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}

Message: SERVER-10026 SERVER-13271 SERVER-12878 SERVER-2094 avoid surplus projections in distinct
Branch: v2.6
https://github.com/mongodb/mongo/commit/094f1565d6f82859bc38300b45564dd1ea9f070e

Comment by Githook User [ 19/Mar/14 ]

Author:

{u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}

Message: SERVER-10026 SERVER-13271 SERVER-12878 SERVER-2094 avoid surplus projections in distinct
Branch: master
https://github.com/mongodb/mongo/commit/267f56a7e0ce36eba21b4b2ef09e32a43370acbf

Comment by Eliot Horowitz (Inactive) [ 24/Jan/14 ]

Sebastian, no, a backport isn't possible. This relies on the query execution system going in 2.6.

Comment by Sebastian Paul [ 24/Jan/14 ]

is it possible that this can be backported to 2.4 stable?

Comment by Githook User [ 23/Jan/14 ]

Author:

{u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}

Message: SERVER-2094 fast distinct when field is indexed and query is fully covered
Branch: master
https://github.com/mongodb/mongo/commit/05db1ea493f0bb00215e0dd94988f9aa927925bc

Comment by Githook User [ 17/Jan/14 ]

Author:

{u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}

Message: Revert "SERVER-2094 distinct only wants the field it's distinct-ing over"

This reverts commit 3b5b31f42d390e9da23f66e78a300d997ae653c4.
Branch: master
https://github.com/mongodb/mongo/commit/5f169de0c5a1651efe39eaa6165c4a3414470dd7

Comment by Githook User [ 17/Jan/14 ]

Author:

{u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}

Message: SERVER-2094 distinct only wants the field it's distinct-ing over
Branch: master
https://github.com/mongodb/mongo/commit/3b5b31f42d390e9da23f66e78a300d997ae653c4

Comment by Richard Uren [ 10/Dec/13 ]

@Benjamin - Your optimization is awesome ! Thanks x 1000 (the performance improvement I got).

Comment by Sebastian Paul [ 05/Dec/13 ]

same here. Using distinct to get log levels.
nothing more than
db.logging.distinct('level');

Index is set

13 million objects in a capped collection
takes about 8 s

Comment by Adam Szpakowski [ 03/Oct/13 ]

Same problem here, distinct takes ~4s (seconds) on indexed collection. Collection fits in RAM, index is used "nscannedObjects" : 0.

> db.runCommand(

{distinct: "sab.sessions", key: "customer"}

)
{
"values" : [... <-- 5 distinct values
],
"stats" :

{ "n" : 7274033, "nscanned" : 7274033, "nscannedObjects" : 0, "timems" : 4043, "cursor" : "BtreeCursor customer_1" }

,
"ok" : 1
}

Now withouth index, it takes ~6.5 seconds:
> db.runCommand(

{distinct: "sab.sessions", key: "customer"}

)
{
"values" : [...],
"stats" :

{ "n" : 7274033, "nscanned" : 7274033, "nscannedObjects" : 7274033, "timems" : 6553, "cursor" : "BasicCursor" }

,
"ok" : 1
}

I've tried both normal and hashed inices.

Comment by Eliot Horowitz (Inactive) [ 30/May/13 ]

Ali - nscannedObjects is 0, which implied that is in fact using the index.
Basically its taking ~8ms to walk the entire index.

Comment by Ali Watters [ 27/May/13 ]

On 2.4.3 – collection with an index on a field with ~100 distinct values, and ~4.5 million docs.

~8s on ok server (working set in memory) to run > db.action_log.distinct('action');

Distinct does not use the index action_1, appears to be running a table scan.

> db.runCommand(

{ distinct: 'action_log', key:'action'}

);
{ "values": [ ...
],
"stats" :

{ "n" : 4497162, "nscanned" : 4497162, "nscannedObjects" : 0, "timems" : 7900, "cursor" : "BtreeCursor action_1" }

,
"ok" : 1
}

Something feels broken as SERVER-1673 is marked as fixed.

Comment by Hugo Lopes Tavares [ 05/Dec/12 ]

@Benjamin, thank you very much for this hack! I hope mongodb team write a better distinct asap.

Comment by Benjamin Jackman [ 30/Nov/12 ]

For those stuck with this problem I used a work around
that gave me a several orders of magnitude faster results.

My use case did not have a compound key, but perhaps this can be adapted for
those use cases.

The idea is to first find the smallest key via sort,
then use that in a find() to get the next distinct key
via combination of $gt the last key and sort()
repeating until all are found.

Here a js version that could be entered at the command substituting in
%COL% with your collection name, and
%KEY% with the key you are trying to select distinct, ensure it is indexed

function() {
  //create a find that returns the lowest value set for %KEY% in %COL%
  var f = function (q) {
    var cur = db[%COL%].find(q, {"%KEY%":1}).sort({"%KEY%":1})
    if (cur.hasNext()) {return cur.next().%KEY%} else {return null}
  }
  var cur = f({})
  var ret = [];
  while (cur != null) {
    ret.push(cur);
    cur = f({"%KEY%":{"$gt":cur}})
  }
  return ret;
}

Comment by Tuner [ 23/Aug/12 ]

Same problem like @Sergey Alaev:
1. find region by city
2. find cities by region and country
3. find streets by country, region and city.

+1 for the future anyway

Comment by Jason R. Coombs [ 23/Aug/12 ]

Our use case is simple - we have a chat log server where all log entries are stored in a single collection, one field which is the 'channel', indexed.

We would like to determine quickly which channels have been logged. It's a small number of items (<50) compared to the number of log entries (>1M). We would like to retrieve the channels without scanning every log entry (or index entry).

Comment by Kevin McCarthy [ 21/Nov/11 ]

One more use case to add to the pile:

We have a few million real estate properties with things like city, state, neighborhood, condo building name, etc.

We do lots of distincts. For example:

  • List of all distinct neighborhoods in San Diego
  • List of all distinct neighborhoods in a certain region of San Diego
  • List of all distinct building names in a list of neighborhoods in San Diego

We cache these but the data changes constantly so that only goes so far. We can't really denormalize / precompute these because there are so many combinations of the filter criteria that it isn't practical.

Comment by wouter alleweireldt [ 21/Nov/11 ]

@Eliot:

we're keeping track on a set of our data with of a sort of timestamp field (year field/month field/day field), and we need a distinct on each of those fields to return only the dates that have valid data. Right now we need to run 3 queries : distinct on year, dinstinct on month for that year, distinct on day for that month/year, and totals up to (sometimes over) a second on a rather small set of data (1.1 million docs) which is way to slow for our needs (compared to SQL running the very same query, taking only a few millisecs)

Comment by Sergey Alaev [ 18/Nov/11 ]

I am waiting for this ticked too - it is very important for fast queries on denormalized trees
Our case is address database which is basically a tree: country-region-city-street-house
Queries we need:
1. find region by city
2. find cities by region and country
3. find streets by country, region and city.

The only way i found is denormalized tree, but in this case query 2 returns HUGE amount of duplicate cities.

And currently MongoDB's distinct is not much faster than implemented in java on client side.

edit: denormalization is the only way to avoid client-side joins, but if you do not want to build different collections for different join combinations, effective grouping or distinct is required. Something like that

Comment by Eliot Horowitz (Inactive) [ 18/Nov/11 ]

@wouter - what's the example you have? just to see if its really relevant

Comment by wouter alleweireldt [ 18/Nov/11 ]

Any idea when this could be implemented?
Some of our queries require a fast distinct, but the way distinct is working now, we're not getting the performance we need...

Generated at Thu Feb 08 02:58:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.