[SERVER-4095] Indexes degrade distinct performance rather than improve them Created: 18/Oct/11 Updated: 11/Jul/16 Resolved: 20/Oct/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance, Performance |
| Affects Version/s: | 2.0.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | Remon van Vliet | Assignee: | Brandon Diamond |
| Resolution: | Done | Votes: | 0 |
| Labels: | indexing, performance | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
All |
||
| Operating System: | ALL |
| Participants: |
| Description |
|
On a test set of 5 million records with one of the fields having a value ranging from 0-1000 we're getting some odd performance : > d1 = new Date(); db.test.distinct("a"); print(new Date()-d1) ) Is there any reasonable explanation for this? It would seem that even if for some reason an index causes degraded performance the code shouldn't try and use it (the distinct code on the server implies it should be an optimization) |
| Comments |
| Comment by Remon van Vliet [ 25/Oct/11 ] |
|
We do. We're using a distinct pass on our score collections as the first step in a national ranking mechanism. So we have, say, 300k people constantly interacting with our game servers for the same game and we need a relatively real-time calculation of their ranking within that game and for the game season as a whole. Currently we do (simplified) : 1) ranking = 1 ) Step 2 right now is taking a couple of seconds which is rather bad for us. (FYI, step 3) we only do synchronously for the top 10 and async for the rest of the table). If you have other suggestions to do this I'm all ears |
| Comment by Brandon Diamond [ 25/Oct/11 ] |
|
I'll take a look into potential optimizations. Do you have a use case where the slowdown from distinct is prohibitive? Or at least, very bad? |
| Comment by Remon van Vliet [ 24/Oct/11 ] |
|
Alright. Do you think there's a way to choose the most optimal code path for this? Meaning, decide on whether to use the index or a straight collection walk based on some criteria known upon the invocation of the distinct command? e.g. if(allPagesInMemory).. |
| Comment by Brandon Diamond [ 20/Oct/11 ] |
|
The code is quite complex but my understanding is that reading the non-indexed data is a straight shot from start to finish whereas the B-tree will involve quite a bit of indirection and extra logic. This CPU cost pays off when you're looking to minimize paging – but since we're hitting everything in both cases and everything is in primary memory, I'd imagine that the delay we're seeing is the cost of that extra logic and indirection. |
| Comment by Remon van Vliet [ 20/Oct/11 ] |
|
I'm trying to think of why a b-tree walk would be slower in any scenario and I can't really come up with any. Oh well, I suppose it works as intended. Thanks! |
| Comment by Brandon Diamond [ 20/Oct/11 ] |
|
I'm going to close this ticket for now. Please reach out again if you have any additional questions! |
| Comment by Brandon Diamond [ 19/Oct/11 ] |
|
After running a bunch of tests and poring over the Distinct implementation, I think what we're seeing is the overhead of iterating through a Btree versus the datafile itself. Since the data isn't too large, both versions should pull directly from primary memory. Further, since we're using a covered index, we don't have to hit the datafile in the indexed case (which would make indexed distinct invocations even worse). I've run the test with larger objects and with imbalanced data and the difference between the indexed attempt and the non-indexed attempt begins to decrease and reverse. To get a bit more insight, you can run the database command directly: db.runCommand( { distinct: "collName", key: "a" }). Both versions hit all data (though the indexed version doesn't need to touch any objects). |
| Comment by Remon van Vliet [ 19/Oct/11 ] |
|
Great. Let me know if I can help. |
| Comment by Brandon Diamond [ 18/Oct/11 ] |
|
Hi Remon, I attempted to reproduce the issue on my own workstation and came up with a similar result: > d1 = new Date(); db.large.distinct("a"); print(new Date()-d1); ) While the latency was less pronounced, it's still there. I'm going to take a look under the hood to see what I can find. I'll report back soon. |