<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 05:09:40 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[SERVER-45769] FSM workloads that run commands and expect them to fail cause infinite retry loops</title>
                <link>https://jira.mongodb.org/browse/SERVER-45769</link>
                <project id="10000" key="SERVER">Core Server</project>
                    <description>&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-45767&quot; title=&quot;Blacklist create_database.js from concurrency_replication_multi_stmt_txn&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-45767&quot;&gt;&lt;del&gt;SERVER-45767&lt;/del&gt;&lt;/a&gt; is an example of a situation where a workload can lead to an infinite transaction retry loop inside suites that use &lt;a href=&quot;https://github.com/mongodb/mongo/blob/a7aecc1ff0af7822c38b5f5da2bc0fd27e3f7778/jstests/concurrency/fsm_workload_helpers/auto_retry_transaction.js#L107&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;&lt;tt&gt;withTxnAndAutoRetry&lt;/tt&gt;&lt;/a&gt;. In order to avoid similar mysterious test time out scenarios in the future, this &lt;a href=&quot;https://github.com/mongodb/mongo/blob/a7aecc1ff0af7822c38b5f5da2bc0fd27e3f7778/jstests/concurrency/fsm_workload_helpers/auto_retry_transaction.js#L107&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;retry loop&lt;/a&gt; should only execute for a finite number of retries. &lt;/p&gt;</description>
                <environment></environment>
        <key id="1114302">SERVER-45769</key>
            <summary>FSM workloads that run commands and expect them to fail cause infinite retry loops</summary>
                <type id="1" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14703&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.mongodb.org/images/icons/priorities/major.svg">Major - P3</priority>
                        <status id="6" iconUrl="https://jira.mongodb.org/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="13201">Fixed</resolution>
                                        <assignee username="ali.mir@mongodb.com">Ali Mir</assignee>
                                    <reporter username="maria.vankeulen@mongodb.com">Maria van Keulen</reporter>
                        <labels>
                            <label>former-quick-wins</label>
                            <label>undo-candidate</label>
                    </labels>
                <created>Fri, 24 Jan 2020 22:20:59 +0000</created>
                <updated>Sun, 29 Oct 2023 22:12:54 +0000</updated>
                            <resolved>Tue, 3 Nov 2020 18:21:09 +0000</resolved>
                                                    <fixVersion>4.9.0</fixVersion>
                                    <component>Testing Infrastructure</component>
                                        <votes>0</votes>
                                    <watches>8</watches>
                                                                                                                <comments>
                            <comment id="3474220" author="xgen-internal-githook" created="Tue, 3 Nov 2020 18:12:26 +0000"  >&lt;p&gt;Author:&lt;/p&gt;
{&apos;name&apos;: &apos;Ali Mir&apos;, &apos;email&apos;: &apos;ali.mir@mongodb.com&apos;, &apos;username&apos;: &apos;ali-mir&apos;}
&lt;p&gt;Message: &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-45769&quot; title=&quot;FSM workloads that run commands and expect them to fail cause infinite retry loops&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-45769&quot;&gt;&lt;del&gt;SERVER-45769&lt;/del&gt;&lt;/a&gt; Add additional logging about iterations in auto_retry_transaction.js&lt;br/&gt;
Branch: master&lt;br/&gt;
&lt;a href=&quot;https://github.com/mongodb/mongo/commit/4308e341038a3a0fbdbe7a278c38b395ceb83936&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/mongodb/mongo/commit/4308e341038a3a0fbdbe7a278c38b395ceb83936&lt;/a&gt;&lt;/p&gt;</comment>
                            <comment id="3035463" author="siyuan.zhou@10gen.com" created="Fri, 10 Apr 2020 19:41:12 +0000"  >&lt;p&gt;The original issue was some tests expect the commands to fail, so they cannot run in the passthrough transactions, otherwise they will retry infinitely. One solution is to blacklist the test since as in &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-45767&quot; title=&quot;Blacklist create_database.js from concurrency_replication_multi_stmt_txn&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-45767&quot;&gt;&lt;del&gt;SERVER-45767&lt;/del&gt;&lt;/a&gt;, since it only gives us extra test coverage on failed commands. Another more complex solution is to express that the command is expected to fail so it should run in its own transaction and can fail.&lt;/p&gt;

&lt;p&gt;I think it&apos;s fine to just blacklist them from transaction passthrough tests. Instead, we need to improve the debugging experience when this scenario happens. Since the system is busy looping, the core dump isn&apos;t helpful and could be misleading. The JS stacktrace (if available) isn&apos;t useful either. We can probably print out logs when a statement or a transaction is retried many times to help debugging this kind of issues.&lt;/p&gt;</comment>
                            <comment id="2766871" author="ryan.timmons" created="Mon, 27 Jan 2020 22:25:01 +0000"  >&lt;p&gt;We had an idea during triage to simply print a message after the &lt;tt&gt;do/while&lt;/tt&gt; had iterated more than N times. This would help to gather data about how many iterations we actually want and the path to using N as an exit-condition (whether using &lt;tt&gt;do/while&lt;/tt&gt; or &lt;tt&gt;assert.soon&lt;/tt&gt;) becomes more obvious. Adding this print logic would hopefully be easy to do as part of any other work.&lt;/p&gt;</comment>
                            <comment id="2766387" author="maria.vankeulen" created="Mon, 27 Jan 2020 19:28:27 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=judah.schvimer&quot; class=&quot;user-hover&quot; rel=&quot;judah.schvimer&quot;&gt;judah.schvimer&lt;/a&gt; As we introduce more features into multi-document transactions, suites like &lt;tt&gt;concurrency_replication_multi_stmt_txn&lt;/tt&gt; which are primarily suited to idempotent operations will continue to uncover false-positive situations like the BF that resulted in &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-45767&quot; title=&quot;Blacklist create_database.js from concurrency_replication_multi_stmt_txn&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-45767&quot;&gt;&lt;del&gt;SERVER-45767&lt;/del&gt;&lt;/a&gt;. I believe it is important to better future-proof these suites somehow.&lt;br/&gt;
I maintain that it is less desirable to have false-positive time outs due to the way this suite is run than to have false-positive BFs that can at least be tied to the retries in this suite.&lt;br/&gt;
Whichever implementation path we decide on, I believe it is important to take into consideration and communicate to developers how many retries occur as a result of this retry loop.&lt;/p&gt;</comment>
                            <comment id="2766340" author="judah.schvimer" created="Mon, 27 Jan 2020 19:16:57 +0000"  >&lt;p&gt;If there is no &quot;safe&quot; number, I would advocate for an assert.soon timeout of 10 minutes like we do elsewhere so the test doesn&apos;t time out, but we still have an unbounded number of attempts. Any extra BFs to me seem more costly than BFs that are harder to diagnose.&lt;/p&gt;</comment>
                            <comment id="2766328" author="maria.vankeulen" created="Mon, 27 Jan 2020 19:12:40 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=judah.schvimer&quot; class=&quot;user-hover&quot; rel=&quot;judah.schvimer&quot;&gt;judah.schvimer&lt;/a&gt; I think the exact number will need some trial and error, depending on how many times in practice the loop generally needs to run to accommodate all of the TransientTransactionErrors.&lt;/p&gt;

&lt;p&gt;FWIW, I would argue that the BFs that would arise out of this change would be much quicker to diagnose than the potential seemingly malignant (i.e., time out) BFs that could occur with the existing structure.&lt;/p&gt;</comment>
                            <comment id="2766306" author="judah.schvimer" created="Mon, 27 Jan 2020 19:04:32 +0000"  >&lt;p&gt;After how many failed attempts can we guarantee there is a bug? I would only want to lower the number of attempts if it wouldn&apos;t lead to unnecessary BFs.&lt;/p&gt;</comment>
                            <comment id="2766072" author="maria.vankeulen" created="Mon, 27 Jan 2020 17:47:54 +0000"  >&lt;p&gt;Got it. In that case, perhaps we can have both the finite retry loop and the &lt;tt&gt;assert.soon&lt;/tt&gt;. The suite itself relies upon &lt;a href=&quot;https://github.com/mongodb/mongo/blob/666877a5da9a6b4c532df6c0c087bcf45123eed0/jstests/concurrency/fsm_workload_helpers/auto_retry_transaction.js#L165&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;a retry loop&lt;/a&gt; in order to handle true transient transaction errors. Perhaps the &lt;tt&gt;assert.soon&lt;/tt&gt; can be added when we actually perform the &lt;a href=&quot;https://github.com/mongodb/mongo/blob/666877a5da9a6b4c532df6c0c087bcf45123eed0/jstests/concurrency/fsm_workload_helpers/auto_retry_transaction.js#L130&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;contents of the transaction&lt;/a&gt;. This way, we reduce the number of cases where we have to wait for the test to time out for the hang analyzer to be called, but we can still distinguish between a test failing due to exceeding a set number of retries versus a test failing due to a true hang.&lt;br/&gt;
My reservation with replacing the entire retry framework with an &lt;tt&gt;assert.soon&lt;/tt&gt; is it does not distinguish between an infinite quantity of retries versus one retry that takes an infinite amount of time.&lt;/p&gt;</comment>
                            <comment id="2765852" author="ryan.timmons" created="Mon, 27 Jan 2020 16:37:02 +0000"  >&lt;blockquote&gt;
&lt;p&gt;What is the disadvantage of having the retry loop be finite?&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;There is no real disadvantage just that it&apos;s &quot;yet another&quot; place that is doing retries.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In the event of a true hang, the timeout would trigger regardless of the number of retries.&lt;/p&gt;&lt;/blockquote&gt;

&lt;p&gt;The evergreen hang-analyzer kicks in as a last-resort. Relying on the assert.soon integration lets the failure happen sooner and can provide additional context in the logs since the invocation of assert.soon shows up in log backtraces.&lt;/p&gt;</comment>
                            <comment id="2765586" author="maria.vankeulen" created="Mon, 27 Jan 2020 14:59:39 +0000"  >&lt;p&gt;I don&apos;t think using &lt;tt&gt;assert.soon&lt;/tt&gt; here would address the issue of keeping benign test infrastructure-related timeouts separate from genuine timeouts. What is the disadvantage of having the retry loop be finite? In the event of a true hang, the timeout would trigger regardless of the number of retries.&lt;/p&gt;</comment>
                            <comment id="2765568" author="ryan.timmons" created="Mon, 27 Jan 2020 14:47:45 +0000"  >&lt;p&gt;If it used &lt;tt&gt;assert.soon&lt;/tt&gt; it would automatically trigger the hang-analyzer.&lt;/p&gt;

&lt;p&gt;That is in fact my recommendation here - change the &lt;tt&gt;do/while&lt;/tt&gt; to use &lt;tt&gt;assert.soon&lt;/tt&gt;. I&apos;m not 100% sure if the hang-analyzer output would be all that useful to be honest, but consolidating flow-control through a finite number of helpers (such as &lt;tt&gt;assert.soon&lt;/tt&gt;) lets us add more post-test logic in one place.&lt;/p&gt;

&lt;p&gt;Evergreen should also run the hang-analyzer after it exceeds the task timeout. Looks like that&apos;s indeed what happened in the underlying BF (BF-15948). imho that seems like a &quot;last-resort&quot;; doing things explicitly inside flow-control lets us be more careful about where/when/why the hang-analyzer/* and friends are called.&lt;/p&gt;</comment>
                            <comment id="2765557" author="maria.vankeulen" created="Mon, 27 Jan 2020 14:42:34 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=judah.schvimer&quot; class=&quot;user-hover&quot; rel=&quot;judah.schvimer&quot;&gt;judah.schvimer&lt;/a&gt; FWIW, the hang analyzer output was not particularly helpful in this case, since the time out was due to testing infrastructure rather than something in the code going wrong. It would be more useful to have something that this suite specifically outputs, such as a log message stating that the maximum number of retries for this suite was exceeded, in future cases like these. &lt;br/&gt;
Time outs are inherently some of the most difficult cases to debug, so separating out this comparatively benign case would be helpful.&lt;/p&gt;</comment>
                            <comment id="2765487" author="judah.schvimer" created="Mon, 27 Jan 2020 14:16:56 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=maria.vankeulen&quot; class=&quot;user-hover&quot; rel=&quot;maria.vankeulen&quot;&gt;maria.vankeulen&lt;/a&gt;, was the test timeout helpful for getting hang analyzer output? &lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=ryan.timmons&quot; class=&quot;user-hover&quot; rel=&quot;ryan.timmons&quot;&gt;ryan.timmons&lt;/a&gt;, would this be able to get hang analyzer output on failure if it didn&apos;t time out the task?&lt;/p&gt;</comment>
                    </comments>
                <issuelinks>
                            <issuelinktype id="10012">
                    <name>Related</name>
                                                                <inwardlinks description="is related to">
                                        <issuelink>
            <issuekey id="1114201">SERVER-45767</issuekey>
        </issuelink>
                            </inwardlinks>
                                    </issuelinktype>
                    </issuelinks>
                <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10050" key="com.atlassian.jira.toolkit:comments">
                        <customfieldname># Replies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>13.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18555" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname># of Sprints</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>3.0</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10011" key="com.atlassian.jira.plugin.system.customfieldtypes:radiobuttons">
                        <customfieldname>Backwards Compatibility</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10038"><![CDATA[Fully Compatible]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10055" key="com.atlassian.jira.ext.charting:firstresponsedate">
                        <customfieldname>Date of 1st Reply</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Mon, 27 Jan 2020 14:16:56 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10052" key="com.atlassian.jira.toolkit:dayslastcommented">
                        <customfieldname>Days since reply</customfieldname>
                        <customfieldvalues>
                                        3 years, 14 weeks, 1 day ago
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18254" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Dependencies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[]]></customfieldvalue>


                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                        <customfield id="customfield_17050" key="com.atlassian.jira.plugin.system.customfieldtypes:radiobuttons">
                        <customfieldname>Downstream Team Attention</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="16941"><![CDATA[Not Needed]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10857" key="com.pyxis.greenhopper.jira:gh-epic-link">
                        <customfieldname>Epic Link</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>PM-1816</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10057" key="com.atlassian.jira.toolkit:lastusercommented">
                        <customfieldname>Last comment by Customer</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>true</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10056" key="com.atlassian.jira.toolkit:lastupdaterorcommenter">
                        <customfieldname>Last commenter</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>luke.bonanomi@mongodb.com</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_11151" key="com.atlassian.jira.toolkit:LastCommentDate">
                        <customfieldname>Last public comment date</customfieldname>
                        <customfieldvalues>
                            3 years, 14 weeks, 1 day ago
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10032" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Operating System</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10026"><![CDATA[ALL]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_10051" key="com.atlassian.jira.toolkit:participants">
                        <customfieldname>Participants</customfieldname>
                        <customfieldvalues>
                                        <customfieldvalue>ali.mir@mongodb.com</customfieldvalue>
            <customfieldvalue>xgen-internal-githook</customfieldvalue>
            <customfieldvalue>judah.schvimer@mongodb.com</customfieldvalue>
            <customfieldvalue>maria.vankeulen@mongodb.com</customfieldvalue>
            <customfieldvalue>ryan.timmons@mongodb.com</customfieldvalue>
            <customfieldvalue>siyuan.zhou@mongodb.com</customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_14254" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Product Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hwl3w7:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|hxw6af:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                <customfield id="customfield_10557" key="com.pyxis.greenhopper.jira:gh-sprint">
                        <customfieldname>Sprint</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue id="4311">Repl 2020-10-19</customfieldvalue>
    <customfieldvalue id="4312">Repl 2020-11-02</customfieldvalue>
    <customfieldvalue id="4372">Repl 2020-11-16</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_10053" key="com.atlassian.jira.ext.charting:timeinstatus">
                        <customfieldname>Time In Status</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_22870" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Triagers</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_14350" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>serverRank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hwkq5j:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                    </customfields>
    </item>
</channel>
</rss>