<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 03:04:56 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[SERVER-4095] Indexes degrade distinct performance rather than improve them</title>
                <link>https://jira.mongodb.org/browse/SERVER-4095</link>
                <project id="10000" key="SERVER">Core Server</project>
                    <description>&lt;p&gt;On a test set of 5 million records with one of the fields having a value ranging from 0-1000 we&apos;re getting some odd performance :&lt;/p&gt;

&lt;p&gt;&amp;gt; d1 = new Date(); db.test.distinct(&quot;a&quot;); print(new Date()-d1)&lt;br/&gt;
4086&lt;br/&gt;
&amp;gt; d1 = new Date(); db.test.distinct(&quot;a&quot;); print(new Date()-d1)&lt;br/&gt;
4078&lt;br/&gt;
&amp;gt; db.test.ensureIndex(&lt;/p&gt;
{a:1}
&lt;p&gt;)&lt;br/&gt;
&amp;gt; d1 = new Date(); db.test.distinct(&quot;a&quot;); print(new Date()-d1)&lt;br/&gt;
9181&lt;br/&gt;
&amp;gt; d1 = new Date(); db.test.distinct(&quot;a&quot;); print(new Date()-d1)&lt;br/&gt;
9183&lt;/p&gt;

&lt;p&gt;Is there any reasonable explanation for this? It would seem that even if for some reason an index causes degraded performance the code shouldn&apos;t try and use it (the distinct code on the server implies it should be an optimization)&lt;/p&gt;</description>
                <environment>All</environment>
        <key id="23760">SERVER-4095</key>
            <summary>Indexes degrade distinct performance rather than improve them</summary>
                <type id="1" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14703&amp;avatarType=issuetype">Bug</type>
                                            <priority id="4" iconUrl="https://jira.mongodb.org/images/icons/priorities/minor.svg">Minor - P4</priority>
                        <status id="6" iconUrl="https://jira.mongodb.org/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="9">Done</resolution>
                                        <assignee username="brandon">Brandon Diamond</assignee>
                                    <reporter username="remonvv">Remon van Vliet</reporter>
                        <labels>
                            <label>indexing</label>
                            <label>performance</label>
                    </labels>
                <created>Tue, 18 Oct 2011 15:59:54 +0000</created>
                <updated>Mon, 11 Jul 2016 18:34:45 +0000</updated>
                            <resolved>Thu, 20 Oct 2011 15:52:09 +0000</resolved>
                                    <version>2.0.0</version>
                                                    <component>Index Maintenance</component>
                    <component>Performance</component>
                                        <votes>0</votes>
                                    <watches>2</watches>
                                                                                                                <comments>
                            <comment id="62485" author="remonvv" created="Tue, 25 Oct 2011 14:05:47 +0000"  >&lt;p&gt;We do. We&apos;re using a distinct pass on our score collections as the first step in a national ranking mechanism. So we have, say, 300k people constantly interacting with our game servers for the same game and we need a relatively real-time calculation of their ranking within that game and for the game season as a whole. Currently we do (simplified) :&lt;/p&gt;

&lt;p&gt;1) ranking = 1&lt;br/&gt;
2) distinctScores = distinct(&quot;score&quot;).sort(&lt;/p&gt;
{score:-1}
&lt;p&gt;)&lt;br/&gt;
3) for each(distingScore) &lt;/p&gt;
{ scoreToRankMap[score] = ranking; ranking += countPeopleWith(distinctScore); }

&lt;p&gt;Step 2 right now is taking a couple of seconds which is rather bad for us. (FYI, step 3) we only do synchronously for the top 10 and async for the rest of the table). If you have other suggestions to do this I&apos;m all ears &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.mongodb.org/images/icons/emoticons/wink.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;&lt;/p&gt;</comment>
                            <comment id="62479" author="brandon" created="Tue, 25 Oct 2011 13:45:52 +0000"  >&lt;p&gt;I&apos;ll take a look into potential optimizations. Do you have a use case where the slowdown from distinct is prohibitive? Or at least, very bad?&lt;/p&gt;</comment>
                            <comment id="62133" author="remonvv" created="Mon, 24 Oct 2011 11:03:57 +0000"  >&lt;p&gt;Alright. Do you think there&apos;s a way to choose the most optimal code path for this? Meaning, decide on whether to use the index or a straight collection walk based on some criteria known upon the invocation of the distinct command? e.g. if(allPagesInMemory)..&lt;/p&gt;</comment>
                            <comment id="61700" author="brandon" created="Thu, 20 Oct 2011 21:35:52 +0000"  >&lt;p&gt;The code is quite complex but my understanding is that reading the non-indexed data is a straight shot from start to finish whereas the B-tree will involve quite a bit of indirection and extra logic. This CPU cost pays off when you&apos;re looking to minimize paging &amp;#8211; but since we&apos;re hitting everything in both cases and everything is in primary memory, I&apos;d imagine that the delay we&apos;re seeing is the cost of that extra logic and indirection.&lt;/p&gt;</comment>
                            <comment id="61698" author="remonvv" created="Thu, 20 Oct 2011 21:22:26 +0000"  >&lt;p&gt;I&apos;m trying to think of why a b-tree walk would be slower in any scenario and I can&apos;t really come up with any. Oh well, I suppose it works as intended. Thanks!&lt;/p&gt;</comment>
                            <comment id="61628" author="brandon" created="Thu, 20 Oct 2011 15:52:02 +0000"  >&lt;p&gt;I&apos;m going to close this ticket for now. Please reach out again if you have any additional questions!&lt;/p&gt;</comment>
                            <comment id="61515" author="brandon" created="Wed, 19 Oct 2011 21:42:39 +0000"  >&lt;p&gt;After running a bunch of tests and poring over the Distinct implementation, I think what we&apos;re seeing is the overhead of iterating through a Btree versus the datafile itself. Since the data isn&apos;t too large, both versions should pull directly from primary memory. Further, since we&apos;re using a covered index, we don&apos;t have to hit the datafile in the indexed case (which would make indexed distinct invocations even worse).&lt;/p&gt;

&lt;p&gt;I&apos;ve run the test with larger objects and with imbalanced data and the difference between the indexed attempt and the non-indexed attempt begins to decrease and reverse.&lt;/p&gt;

&lt;p&gt;To get a bit more insight, you can run the database command directly: db.runCommand(&lt;/p&gt;
{ distinct: &quot;collName&quot;, key: &quot;a&quot; }
&lt;p&gt;). Both versions hit all data (though the indexed version doesn&apos;t need to touch any objects).&lt;/p&gt;</comment>
                            <comment id="61350" author="remonvv" created="Wed, 19 Oct 2011 08:51:59 +0000"  >&lt;p&gt;Great. Let me know if I can help.&lt;/p&gt;</comment>
                            <comment id="61257" author="brandon" created="Tue, 18 Oct 2011 20:52:13 +0000"  >&lt;p&gt;Hi Remon,&lt;/p&gt;

&lt;p&gt;I attempted to reproduce the issue on my own workstation and came up with a similar result:&lt;/p&gt;

&lt;p&gt;&amp;gt; d1 = new Date(); db.large.distinct(&quot;a&quot;); print(new Date()-d1);&lt;br/&gt;
2111&lt;br/&gt;
&amp;gt; d1 = new Date(); db.large.distinct(&quot;a&quot;); print(new Date()-d1);&lt;br/&gt;
2157&lt;br/&gt;
&amp;gt; db.large.ensureIndex(&lt;/p&gt;
{a:1}
&lt;p&gt;)&lt;br/&gt;
&amp;gt; d1 = new Date(); db.large.distinct(&quot;a&quot;); print(new Date()-d1);&lt;br/&gt;
3331&lt;br/&gt;
&amp;gt; d1 = new Date(); db.large.distinct(&quot;a&quot;); print(new Date()-d1);&lt;br/&gt;
3316&lt;br/&gt;
&amp;gt; &lt;/p&gt;

&lt;p&gt;While the latency was less pronounced, it&apos;s still there.&lt;/p&gt;

&lt;p&gt;I&apos;m going to take a look under the hood to see what I can find. I&apos;ll report back soon.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10050" key="com.atlassian.jira.toolkit:comments">
                        <customfieldname># Replies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10055" key="com.atlassian.jira.ext.charting:firstresponsedate">
                        <customfieldname>Date of 1st Reply</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Tue, 18 Oct 2011 17:16:02 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10052" key="com.atlassian.jira.toolkit:dayslastcommented">
                        <customfieldname>Days since reply</customfieldname>
                        <customfieldvalues>
                                        12 years, 17 weeks, 1 day ago
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18254" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Dependencies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[]]></customfieldvalue>


                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_10057" key="com.atlassian.jira.toolkit:lastusercommented">
                        <customfieldname>Last comment by Customer</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>true</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10056" key="com.atlassian.jira.toolkit:lastupdaterorcommenter">
                        <customfieldname>Last commenter</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>ramon.fernandez@mongodb.com</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_11151" key="com.atlassian.jira.toolkit:LastCommentDate">
                        <customfieldname>Last public comment date</customfieldname>
                        <customfieldvalues>
                            12 years, 17 weeks, 1 day ago
                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_10000" key="com.atlassian.jira.plugin.system.customfieldtypes:radiobuttons">
                        <customfieldname>Old_Backport</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10000"><![CDATA[No]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10032" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Operating System</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10026"><![CDATA[ALL]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_10051" key="com.atlassian.jira.toolkit:participants">
                        <customfieldname>Participants</customfieldname>
                        <customfieldvalues>
                                        <customfieldvalue>brandon</customfieldvalue>
            <customfieldvalue>remonvv</customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_14254" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Product Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hrontr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|hriqon:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>22976</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_23361" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Requested By</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10053" key="com.atlassian.jira.ext.charting:timeinstatus">
                        <customfieldname>Time In Status</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_22870" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Triagers</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_14350" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>serverRank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|ht0bvb:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                    </customfields>
    </item>
</channel>
</rss>