<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 08:59:39 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[JAVA-3457] Gracefully handle mongos nodes exiting via mongodb+srv:// </title>
                <link>https://jira.mongodb.org/browse/JAVA-3457</link>
                <project id="10006" key="JAVA">Java Driver</project>
                    <description>&lt;p&gt;We recently set up a shared cluster of MongoS servers in kubernetes via the fairly new mongodb+srv record support (&lt;a href=&quot;https://www.mongodb.com/blog/post/mongodb-3-6-here-to-SRV-you-with-easier-replica-set-connections&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://www.mongodb.com/blog/post/mongodb-3-6-here-to-SRV-you-with-easier-replica-set-connections&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;In Kubernetes, when nodes enter a terminating state, they are removed from from both the SRV record broadcast, and their DNS resolution will also no longer succeed. In some cases (depends on configuration), they may still be available to handle connections for some amount of time, until the pod has fully terminated.&lt;/p&gt;

&lt;p&gt;The Mongo java driver currently scans SRV records every 60 seconds, &lt;a href=&quot;#L28&quot; target=&quot;_blank&quot; rel=&quot;noopener&quot;&gt;which is hardcoded&lt;/a&gt;].]&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;When a mongos pod enters termination, that leaves an up-to-60-second gap where, to my understanding, we can hit issues in the java mongo driver through the following path.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;The mongodb java driver selects a random host from known available hosts - assume it has chosen a recently terminated host&lt;/li&gt;
	&lt;li&gt;If the connection pool needs to spawn a new connection, the driver does a dns lookup on the host. &lt;a href=&quot;https://github.com/mongodb/mongo-java-driver/blob/146c465c8be582a51b4763e2a0b8b0b93e8d072d/driver-core/src/main/com/mongodb/ServerAddress.java#L211&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;link&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;The DNS lookup fails for the recently shut down host. This throws an exception which invalidates all active connections to this host (including currently-functioning connections) &lt;a href=&quot;https://github.com/mongodb/mongo-java-driver/blob/3c19b93b111dd315a2ee2892bad4fa213ac4ea39/driver-core/src/main/com/mongodb/internal/connection/DefaultServer.java#L90&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;link&lt;/a&gt;&#160;&lt;/li&gt;
	&lt;li&gt;Until the SrvRecordMonitor refreshes it&apos;s host pool, all queries have a 1/pool_size chance of failing because server selection is random. Operation retries don&apos;t fully handle failure, but reduce the chance of query failure to (1/pool_size ^ retry_count)&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;There seem to be a couple potential mechanisms for improving this. I can imagine blacklisting hosts that have experienced dns failures until the next refresh when using mongodb+srv, but there seem to be several reasonable options.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;We&apos;d be happy to contribute a patch here if there&apos;s an agreed upon handling strategy for us to pursue.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</description>
                <environment></environment>
        <key id="966471">JAVA-3457</key>
            <summary>Gracefully handle mongos nodes exiting via mongodb+srv:// </summary>
                <type id="2" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14711&amp;avatarType=issuetype">New Feature</type>
                                            <priority id="3" iconUrl="https://jira.mongodb.org/images/icons/priorities/major.svg">Major - P3</priority>
                        <status id="6" iconUrl="https://jira.mongodb.org/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="13202">Works as Designed</resolution>
                                        <assignee username="jeff.yemin@mongodb.com">Jeffrey Yemin</assignee>
                                    <reporter username="bpicolo@squarespace.com">Ben Picolo</reporter>
                        <labels>
                    </labels>
                <created>Thu, 10 Oct 2019 13:23:13 +0000</created>
                <updated>Fri, 27 Oct 2023 13:21:02 +0000</updated>
                            <resolved>Wed, 27 Nov 2019 01:32:02 +0000</resolved>
                                                                    <component>Cluster Management</component>
                                        <votes>0</votes>
                                    <watches>7</watches>
                                                                                                                <comments>
                            <comment id="2567592" author="jeff.yemin" created="Wed, 27 Nov 2019 01:32:02 +0000"  >&lt;p&gt;Closing this out as I believe we&apos;ve answered all the open questions, and demonstrated how to orchestrate a service such that there is no visible application effects.&lt;/p&gt;

&lt;p&gt;If you have further questions, please post them and we can re-open.&lt;/p&gt;</comment>
                            <comment id="2486512" author="jeff.yemin" created="Thu, 17 Oct 2019 13:27:22 +0000"  >&lt;ul&gt;
	&lt;li&gt;heartbeatFrequency: decreasing this value will allow the server monitors to determine that a server is unavailable faster.&#160; Note though that in an active application, application threads will fail more frequently than this and change the state to unavailable before the server monitor gets around to finding out&lt;/li&gt;
	&lt;li&gt;heartbeatConnectTimeout,&#160;heartbeatSocketTimeout: these control how fast the server monitor will fail in the face of network errors.&#160; More of an issue if your server doesn&apos;t come down cleanly though.&#160; If you bring the mongos process down in an orderly fashion, the server should promptly notify the client that the socket is no good, and the client doesn&apos;t have to wait for timeouts&lt;/li&gt;
	&lt;li&gt;connectTimeout, socketTimeout: similar to above, but applies to operations initiated by your application.&lt;/li&gt;
&lt;/ul&gt;
</comment>
                            <comment id="2486461" author="bpicolo@squarespace.com" created="Thu, 17 Oct 2019 12:51:52 +0000"  >&lt;p&gt;Which timeouts and server monitor frequencies are adjustable that help out here?&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;The second part you mention may be the missing piece of the puzzle here, but will have to figure out if there&apos;s a strategy for us to disallow new connections efficiently. I&apos;ll look into that path, and appreciate the response on this.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Unfortunately, I don&apos;t believe we get tailored control over the timings for SRV records in kubernetes (that&apos;s a path we were looking into as well).&lt;/p&gt;</comment>
                            <comment id="2486059" author="jeff.yemin" created="Thu, 17 Oct 2019 01:41:09 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=bpicolo%40squarespace.com&quot; class=&quot;user-hover&quot; rel=&quot;bpicolo@squarespace.com&quot;&gt;bpicolo@squarespace.com&lt;/a&gt;, the driver does handle application shutdown. Though there is a window during which one or more application threads may get exceptions, the window is fairly short, and can be controlled by the client through the setting of various timeouts and server monitor frequencies.&lt;/p&gt;

&lt;p&gt;The problem you seem to be having is due to the host being removed from DNS entirely prior to shutting the mongos process down. I can think of a few things you could do to improve your situation:&lt;/p&gt;
&lt;ol&gt;
	&lt;li&gt;Delay DNS removal for 60 seconds after updating the SRV record to exclude the mongos.&#160; If you do that you won&apos;t get any application errors, and the driver will have time to update its list of mongos servers&lt;/li&gt;
	&lt;li&gt;Alternatively, shut down the mongos process before making any DNS changes. The driver will detect that the mongos process has closed its connections, and that mongos will no longer be selected for any operations.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="2483171" author="louis.plissonneau" created="Tue, 15 Oct 2019 16:41:25 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=andrey.belik&quot; class=&quot;user-hover&quot; rel=&quot;andrey.belik&quot;&gt;andrey.belik&lt;/a&gt; if you manually kill/remove the pod, it will spin up a new one almost immediately&lt;/p&gt;

&lt;p&gt;when mongos crashes on the pod, the automation agent will try to restart mongos processes&lt;/p&gt;

&lt;p&gt;the liveness (every 30 seconds) and readiness (every 5 seconds) probes will detect the loss but they have a failure rate (to prevent over-reacting), so it will take 3 minutes minimum for kubernetes to react (we need 6 liveness failures in a row, and it&apos;s longer for readiness probe)&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Thinking about this, it&apos;s about time we need to revisit the liveness probe&lt;/p&gt;</comment>
                            <comment id="2480174" author="bpicolo@squarespace.com" created="Mon, 14 Oct 2019 13:56:26 +0000"  >&lt;p&gt;@Andrey - worth clarifying, the driver currently handles neither case, as far as I can tell (clean or unclean application shutdown).&lt;/p&gt;</comment>
                            <comment id="2479870" author="andrey.belik" created="Mon, 14 Oct 2019 10:47:23 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=louis.plissonneau&quot; class=&quot;user-hover&quot; rel=&quot;louis.plissonneau&quot;&gt;louis.plissonneau&lt;/a&gt;&#160;please confirm if I am correct here. All mongos is fronted with Service that exposes SRV Records.&lt;/p&gt;

&lt;p&gt;When mongos is terminated K8S controller updates DNS pretty much immediately (but it is eventual consistency model)&#160;&lt;/p&gt;

&lt;p&gt;When mongos crashes it will be detected by K8S and that could take longer (few seconds) and then it will be taken our from DNS and new provisioned.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="2475687" author="bpicolo@squarespace.com" created="Thu, 10 Oct 2019 14:28:10 +0000"  >&lt;p&gt;I&apos;ll check whether that would be a factor for us - I&apos;m not sure what sort of SLA we have in place. Let me consult some folk in my organization.&lt;/p&gt;</comment>
                            <comment id="2475681" author="jeff.yemin" created="Thu, 10 Oct 2019 14:24:51 +0000"  >&lt;p&gt;&lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=bpicolo%40squarespace.com&quot; class=&quot;user-hover&quot; rel=&quot;bpicolo@squarespace.com&quot;&gt;bpicolo@squarespace.com&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;No problem with opening a ticket directly here, but just be advised that there is no SLA in place when you do it this way.&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="2475658" author="bpicolo@squarespace.com" created="Thu, 10 Oct 2019 14:19:27 +0000"  >&lt;p&gt;I am not - we thought that this board may be the best first point of discussion, but happy to redirect wherever would be best.&lt;/p&gt;</comment>
                            <comment id="2475655" author="jeff.yemin" created="Thu, 10 Oct 2019 14:18:29 +0000"  >&lt;p&gt;It was not a bot &lt;img class=&quot;emoticon&quot; src=&quot;https://jira.mongodb.org/images/icons/emoticons/smile.png&quot; height=&quot;16&quot; width=&quot;16&quot; align=&quot;absmiddle&quot; alt=&quot;&quot; border=&quot;0&quot;/&gt;.&#160;&#160;&lt;/p&gt;

&lt;p&gt;Changed it back to what you intended.&lt;/p&gt;

&lt;p&gt;Are you in contact with our technical support organization on this already by any chance?&lt;/p&gt;</comment>
                            <comment id="2475509" author="bpicolo@squarespace.com" created="Thu, 10 Oct 2019 13:29:51 +0000"  >&lt;p&gt;@jeff.yemin - I see you or a bot version of you tweaked some wording for me (thanks!) Want to note that &quot;shared&quot; was intentional, though. The sharding isn&apos;t new in this case, the shared MongoS fleet is.&lt;/p&gt;</comment>
                            <comment id="2475496" author="bpicolo@squarespace.com" created="Thu, 10 Oct 2019 13:24:41 +0000"  >&lt;p&gt;I don&apos;t appear to have permissions to edit my ticket, but here&apos;s the link I had intended for the DefaultSrvRecordMonitorFactory:&#160;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/mongodb/mongo-java-driver/blob/f0124e36f5d7bbf8442570d1304f73ca6f5b16a1/driver-core/src/main/com/mongodb/internal/connection/DefaultDnsSrvRecordMonitorFactory.java#L28&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;https://github.com/mongodb/mongo-java-driver/blob/f0124e36f5d7bbf8442570d1304f73ca6f5b16a1/driver-core/src/main/com/mongodb/internal/connection/DefaultDnsSrvRecordMonitorFactory.java#L28&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;

&lt;p&gt;Also worth mentioning - we&apos;re currently using the latest &lt;b&gt;3.x&lt;/b&gt; driver.&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|hvlh1z:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            </customfields>
    </item>
</channel>
</rss>