<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 04:44:19 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[SERVER-36871] $sample can loop infinitely on orphaned data</title>
                <link>https://jira.mongodb.org/browse/SERVER-36871</link>
                <project id="10000" key="SERVER">Core Server</project>
                    <description>&lt;p&gt;The following scenario (at least) can cause an infinite loop in $sample:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;A moveChunk command starts moving a chunk from shard0 to shard1&lt;/li&gt;
	&lt;li&gt;$sample begins and a getMore targets shard1&lt;/li&gt;
	&lt;li&gt;That getMore uses a cursor that samples randomly with replacement from WiredTiger&lt;/li&gt;
	&lt;li&gt;That cursor only fetches one document because the sample size is one, and that document does not belong to that shard because it&apos;s in the chunk that&apos;s being moved from shard0 to shard1 and does not own it. The ShardFilterStage then filters that document and returns NEEDS_TIME&lt;/li&gt;
	&lt;li&gt;The yielding policy is NO_YIELD for some reason so no yielding happens and it tries to sample the cursor again but gets the same document back. This happens infinitely. The yielding behavior may not affect the infinite loop in this case, but it&apos;s still unexpected so I&apos;m including it here.&lt;/li&gt;
&lt;/ol&gt;
</description>
                <environment></environment>
        <key id="594247">SERVER-36871</key>
            <summary>$sample can loop infinitely on orphaned data</summary>
                <type id="1" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14703&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.mongodb.org/images/icons/priorities/major.svg">Major - P3</priority>
                        <status id="6" iconUrl="https://jira.mongodb.org/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="13201">Fixed</resolution>
                                        <assignee username="bernard.gorman@mongodb.com">Bernard Gorman</assignee>
                                    <reporter username="matthew.saltz@mongodb.com">Matthew Saltz</reporter>
                        <labels>
                    </labels>
                <created>Fri, 24 Aug 2018 21:54:14 +0000</created>
                <updated>Sun, 29 Oct 2023 22:28:38 +0000</updated>
                            <resolved>Tue, 11 Dec 2018 17:56:01 +0000</resolved>
                                                    <fixVersion>4.1.7</fixVersion>
                                    <component>Aggregation Framework</component>
                                        <votes>0</votes>
                                    <watches>13</watches>
                                                                                                                <comments>
                            <comment id="2087261" author="xgen-internal-githook" created="Tue, 11 Dec 2018 17:54:35 +0000"  >&lt;p&gt;Author:&lt;/p&gt;
{&apos;name&apos;: &apos;Bernard Gorman&apos;, &apos;email&apos;: &apos;bernard.gorman@gmail.com&apos;, &apos;username&apos;: &apos;gormanb&apos;}
&lt;p&gt;Message: &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-36871&quot; title=&quot;$sample can loop infinitely on orphaned data&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-36871&quot;&gt;&lt;del&gt;SERVER-36871&lt;/del&gt;&lt;/a&gt; $sample can loop infinitely on orphaned data&lt;br/&gt;
Branch: master&lt;br/&gt;
&lt;a href=&quot;https://github.com/mongodb/mongo/commit/afe80e6c70b5658f717a268f698c305c098fbc92&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/mongodb/mongo/commit/afe80e6c70b5658f717a268f698c305c098fbc92&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="2064952" author="bernard.gorman" created="Fri, 16 Nov 2018 18:42:45 +0000"  >&lt;p&gt;Hi &lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=stuart.hall%40masternaut.com&quot; class=&quot;user-hover&quot; rel=&quot;stuart.hall@masternaut.com&quot;&gt;stuart.hall@masternaut.com&lt;/a&gt; - thank you for providing this additional context. Below, I&apos;ve outlined our understanding of this bug following some further investigation.&lt;/p&gt;

&lt;p&gt;To briefly describe the core of the problem: we have an optimization in place for the &lt;tt&gt;$sample&lt;/tt&gt; aggregation stage which, under certain circumstances, allows it to replace the &lt;a href=&quot;https://github.com/mongodb/mongo/blob/1da5a8ac8ea43e1f704384238765fa5ca5b11af6/src/mongo/db/pipeline/document_source_sample.cpp&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;default &lt;tt&gt;$sample&lt;/tt&gt; implementation&lt;/a&gt; with an alternative that &lt;a href=&quot;https://github.com/mongodb/mongo/blob/1da5a8ac8ea43e1f704384238765fa5ca5b11af6/src/mongo/db/pipeline/document_source_sample_from_random_cursor.cpp&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;uses a random-read cursor to pull back results&lt;/a&gt;. Specifically, the optimized stage will be used if the requested &lt;tt&gt;$sample&lt;/tt&gt; size is 5% or less of the total number of documents in the collection &lt;b&gt;and&lt;/b&gt; that total is greater than 100 documents. The problem stems from the fact that, because we cannot tell &lt;em&gt;a priori&lt;/em&gt; how many orphans are present on the shard, this decision is based on a document count that includes &lt;b&gt;all&lt;/b&gt; documents in the collection, whether they are orphans or not. As with any query, we counteract this by adding a &lt;a href=&quot;https://github.com/mongodb/mongo/blob/1da5a8ac8ea43e1f704384238765fa5ca5b11af6/src/mongo/db/exec/shard_filter.cpp&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;&lt;tt&gt;ShardingFilter&lt;/tt&gt; stage&lt;/a&gt; to the query plan, which will filter out any orphans that we may encounter.&lt;/p&gt;

&lt;p&gt;The problem arises in cases where the ratio of orphans to legitimate documents is very high or, in the worst case, where there are &lt;b&gt;only&lt;/b&gt; orphans on the shard &lt;em&gt;but enough of them to trigger the optimized &lt;tt&gt;$sample&lt;/tt&gt; stage&lt;/em&gt;. Due to the fact that the randomized cursor will continue returning documents indefinitely, including duplicates, for as long as more documents are requested, we have logic in the &lt;tt&gt;DocumentSourceSampleFromRandomCursor&lt;/tt&gt; class to track the &lt;tt&gt;ids&lt;/tt&gt; of returned documents and &lt;a href=&quot;https://github.com/mongodb/mongo/blob/9f2d9ce70ecf475386ead7374bf749e0f231c294/src/mongo/db/pipeline/document_source_sample_from_random_cursor.cpp#L141-L145&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;to abort if we see more than 100 duplicates before completing the &lt;tt&gt;$sample&lt;/tt&gt;&lt;/a&gt;. However, the &lt;tt&gt;ShardingFilter&lt;/tt&gt; stage of the underlying query plan has no such logic; it assumes that the input stream is always finite and deduplicated, and that no matter how many orphans it has to filter, it will eventually hit EOF. In the scenario discussed here, where there are &lt;em&gt;&lt;b&gt;only&lt;/b&gt;&lt;/em&gt;&#160;orphans on the shard, the result is that every single one of the infinite documents returned by the random cursor is an orphan that the &lt;tt&gt;ShardingFilter&lt;/tt&gt; discards, control never returns from the query system, and &lt;tt&gt;DSSampleFromRandomCursor&lt;/tt&gt; never gets the opportunity to either complete or terminate the aggregation. If there &lt;b&gt;are&lt;/b&gt; enough legitimate documents on the shard to satisfy the &lt;tt&gt;$sample&lt;/tt&gt; but they are dwarfed by the number of orphans, then the &lt;tt&gt;$sample&lt;/tt&gt; will eventually complete, but it may take a prohibitively long time.&lt;/p&gt;

&lt;p&gt;What made this particularly pathological in your case was the associated &lt;tt&gt;NO_YIELD&lt;/tt&gt; bug described in &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-37750&quot; title=&quot;Optimized $sample stage does not yield&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-37750&quot;&gt;&lt;del&gt;SERVER-37750&lt;/del&gt;&lt;/a&gt;. While it does not cause the problem described above, it meant that the &lt;tt&gt;$sample&lt;/tt&gt; operation could never yield, and consequently could not be halted. Fixing &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-37750&quot; title=&quot;Optimized $sample stage does not yield&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-37750&quot;&gt;&lt;del&gt;SERVER-37750&lt;/del&gt;&lt;/a&gt; ensures that, even in the event that&#160;&lt;tt&gt;$sample&lt;/tt&gt;&#160;hits the infinite-loop scenario outlined above, it can be &lt;tt&gt;killOp&apos;d&lt;/tt&gt; &lt;b&gt;without&lt;/b&gt; needing to &lt;tt&gt;kill -9&lt;/tt&gt; the &lt;tt&gt;mongod&lt;/tt&gt; itself, as was necessary in your case.&lt;/p&gt;

&lt;p&gt;As you can see from &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-37750&quot; title=&quot;Optimized $sample stage does not yield&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-37750&quot;&gt;&lt;del&gt;SERVER-37750&lt;/del&gt;&lt;/a&gt;, we recently fixed the latter issue, and we are currently evaluating it for backport to earlier branches. We also have a proposed solution to the bug described in this ticket, and will similarly consider backporting it once it has been fixed in master.&lt;/p&gt;

&lt;p&gt;I hope this helps to clarify the behaviour you observed, and the actions we are taking to address it. Thank you for bringing it to our attention!&lt;/p&gt;

&lt;p&gt;Regards,&lt;br/&gt;
 Bernard&lt;/p&gt;</comment>
                            <comment id="2041570" author="stuart.hall@masternaut.com" created="Thu, 25 Oct 2018 08:47:20 +0000"  >&lt;blockquote&gt;&lt;p&gt;Assigning to Charlie to figure out what changed in sharding and assess the likelihood of this happening in a real world scenario.&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;To add information to this, we have now experienced what we believe is this problem in our production environment when running $sample while there are orphaned documents on a particular shard. In our case, the code executed was as simple as can be:&lt;/p&gt;
&lt;p/&gt;
&lt;div id=&quot;syntaxplugin&quot; class=&quot;syntaxplugin&quot; style=&quot;border: 1px dashed #bbb; border-radius: 5px !important; overflow: auto; max-height: 30em;&quot;&gt;
&lt;table cellspacing=&quot;0&quot; cellpadding=&quot;0&quot; border=&quot;0&quot; width=&quot;100%&quot; style=&quot;font-size: 1em; line-height: 1.4em !important; font-weight: normal; font-style: normal; color: black;&quot;&gt;
		&lt;tbody &gt;
				&lt;tr id=&quot;syntaxplugin_code_and_gutter&quot;&gt;
						&lt;td  style=&quot; line-height: 1.4em !important; padding: 0em; vertical-align: top;&quot;&gt;
					&lt;pre style=&quot;font-size: 1em; margin: 0 10px;  margin-top: 10px;   margin-bottom: 10px;  width: auto; padding: 0;&quot;&gt;&lt;span style=&quot;color: black; font-family: &apos;Consolas&apos;, &apos;Bitstream Vera Sans Mono&apos;, &apos;Courier New&apos;, Courier, monospace !important;&quot;&gt;db.collection.aggregate( [ { $sample : { size : &lt;/span&gt;&lt;span style=&quot;color: #009900; font-family: &apos;Consolas&apos;, &apos;Bitstream Vera Sans Mono&apos;, &apos;Courier New&apos;, Courier, monospace !important;&quot;&gt;1&lt;/span&gt;&lt;span style=&quot;color: black; font-family: &apos;Consolas&apos;, &apos;Bitstream Vera Sans Mono&apos;, &apos;Courier New&apos;, Courier, monospace !important;&quot;&gt; } } ] );&lt;/span&gt;&lt;/pre&gt;
			&lt;/td&gt;
		&lt;/tr&gt;
			&lt;/tbody&gt;
&lt;/table&gt;
&lt;/div&gt;
&lt;p/&gt;
&lt;p&gt;In our case, this query partially froze the mongod process with at least one core running 100% and all write queries blocked. We had to kill -9 the offending node to restore the shard to operation. We know that this particular shard has some orphaned data as we have evacuated all data-bearing chunks from this shard, but still have a number of documents remaining in the local collection in shardKey ranges that overlapped with chunks that should have been on different shards.&lt;/p&gt;

&lt;p&gt;Because this issue resulted in a loss of service to our production cluster, we have raised a support ticket (00526155) but I thought it worth also updating directly here for visibility.&lt;/p&gt;</comment>
                            <comment id="2024918" author="charlie.swanson" created="Fri, 5 Oct 2018 13:27:27 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=greg.mckeon&quot; class=&quot;user-hover&quot; rel=&quot;greg.mckeon&quot;&gt;greg.mckeon&lt;/a&gt; I have not started investigating this yet. I haven&apos;t had time to thus far.&lt;/p&gt;</comment>
                            <comment id="1996376" author="david.storch" created="Fri, 7 Sep 2018 15:17:21 +0000"  >&lt;p&gt;Assigning to Charlie to figure out what changed in sharding and assess the likelihood of this happening in a real world scenario.&lt;/p&gt;</comment>
                            <comment id="1987447" author="david.storch" created="Tue, 28 Aug 2018 13:05:51 +0000"  >&lt;p&gt;This work should include re-enabling the &lt;tt&gt;$sample&lt;/tt&gt; tests in testshard1.js.&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10420">
                    <name>Backports</name>
                                            <outwardlinks description="backported by">
                                                        </outwardlinks>
                                                        </issuelinktype>
                            <issuelinktype id="10012">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="594248">SERVER-36872</issuekey>
        </issuelink>
            <issuelink>
            <issuekey id="624237">SERVER-37750</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10050" key="com.atlassian.jira.toolkit:comments">
                        <customfieldname># Replies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>6.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18555" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname># of Sprints</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>4.0</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_12450" key="com.atlassian.jira.plugin.system.customfieldtypes:multicheckboxes">
                        <customfieldname>Backport Requested</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="15640"><![CDATA[v4.0]]></customfieldvalue>
    <customfieldvalue key="15141"><![CDATA[v3.6]]></customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10011" key="com.atlassian.jira.plugin.system.customfieldtypes:radiobuttons">
                        <customfieldname>Backwards Compatibility</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10038"><![CDATA[Fully Compatible]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_13552" key="com.go2group.jira.plugin.crm:crm_generic_field">
                        <customfieldname>Case</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[[500A000000cQLAGIA4]]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10055" key="com.atlassian.jira.ext.charting:firstresponsedate">
                        <customfieldname>Date of 1st Reply</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Mon, 27 Aug 2018 14:09:29 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10052" key="com.atlassian.jira.toolkit:dayslastcommented">
                        <customfieldname>Days since reply</customfieldname>
                        <customfieldvalues>
                                        5 years, 9 weeks, 1 day ago
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18254" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Dependencies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[]]></customfieldvalue>


                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_10057" key="com.atlassian.jira.toolkit:lastusercommented">
                        <customfieldname>Last comment by Customer</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>true</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10056" key="com.atlassian.jira.toolkit:lastupdaterorcommenter">
                        <customfieldname>Last commenter</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>luke.bonanomi@mongodb.com</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_11151" key="com.atlassian.jira.toolkit:LastCommentDate">
                        <customfieldname>Last public comment date</customfieldname>
                        <customfieldvalues>
                            5 years, 9 weeks, 1 day ago
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                <customfield id="customfield_10051" key="com.atlassian.jira.toolkit:participants">
                        <customfieldname>Participants</customfieldname>
                        <customfieldvalues>
                                        <customfieldvalue>bernard.gorman@mongodb.com</customfieldvalue>
            <customfieldvalue>charlie.swanson@mongodb.com</customfieldvalue>
            <customfieldvalue>david.storch@mongodb.com</customfieldvalue>
            <customfieldvalue>xgen-internal-githook</customfieldvalue>
            <customfieldvalue>matthew.saltz@mongodb.com</customfieldvalue>
            <customfieldvalue>stuart.hall@masternaut.com</customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_14254" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Product Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hu6cn3:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|htxobr:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_23361" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Requested By</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_10557" key="com.pyxis.greenhopper.jira:gh-sprint">
                        <customfieldname>Sprint</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue id="2487">Query 2018-10-08</customfieldvalue>
    <customfieldvalue id="2535">Query 2018-11-19</customfieldvalue>
    <customfieldvalue id="2536">Query 2018-12-03</customfieldvalue>
    <customfieldvalue id="2572">Query 2018-12-17</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10053" key="com.atlassian.jira.ext.charting:timeinstatus">
                        <customfieldname>Time In Status</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_22870" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Triagers</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_14350" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>serverRank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hu5ywf:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                    </customfields>
    </item>
</channel>
</rss>