<!-- 
RSS generated by JIRA (9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66) at Thu Feb 08 04:40:10 UTC 2024

It is possible to restrict the fields that are returned in this document by specifying the 'field' parameter in your request.
For example, to request only the issue key and summary append 'field=key&field=summary' to the URL of your request.
-->
<rss version="0.92" >
<channel>
    <title>MongoDB Jira</title>
    <link>https://jira.mongodb.org</link>
    <description>This file is an XML representation of an issue</description>
    <language>en-us</language>    <build-info>
        <version>9.7.1</version>
        <build-number>970001</build-number>
        <build-date>13-04-2023</build-date>
    </build-info>


<item>
            <title>[SERVER-35543] Secondary server got frozen with 100% CPU</title>
                <link>https://jira.mongodb.org/browse/SERVER-35543</link>
                <project id="10000" key="SERVER">Core Server</project>
                    <description>&lt;p&gt;We have a sharding cluster DB, with 8 shards and each of them using two replica sets + arbiter. Today we had a problem in one of the secondaries server: it suddenly started to use 100% CPU, and did not respond to any query. It remained in that state until restarted.&lt;/p&gt;

&lt;p&gt;I&apos;m attaching stack trace from &quot;pstack&quot; in case it helps, it seems most threads are waiting for a lock, except some of them which might be hoarding the locks while consuming all CPU (this server has 2 CPUs):&#160;Threads 70, 73 and&#160;83&lt;/p&gt;</description>
                <environment>CentOS 7</environment>
        <key id="557893">SERVER-35543</key>
            <summary>Secondary server got frozen with 100% CPU</summary>
                <type id="1" iconUrl="https://jira.mongodb.org/secure/viewavatar?size=xsmall&amp;avatarId=14703&amp;avatarType=issuetype">Bug</type>
                                            <priority id="3" iconUrl="https://jira.mongodb.org/images/icons/priorities/major.svg">Major - P3</priority>
                        <status id="6" iconUrl="https://jira.mongodb.org/images/icons/statuses/closed.png" description="The issue is considered finished, the resolution is correct. Issues which are closed can be reopened.">Closed</status>
                    <statusCategory id="3" key="done" colorName="success"/>
                                    <resolution id="13203">Gone away</resolution>
                                        <assignee username="bruce.lucas@mongodb.com">Bruce Lucas</assignee>
                                    <reporter username="icruz">Isaac Cruz</reporter>
                        <labels>
                    </labels>
                <created>Tue, 12 Jun 2018 09:13:06 +0000</created>
                <updated>Fri, 27 Oct 2023 20:43:23 +0000</updated>
                            <resolved>Fri, 26 Oct 2018 14:25:58 +0000</resolved>
                                    <version>3.6.2</version>
                                                                        <votes>0</votes>
                                    <watches>12</watches>
                                                                                                                <comments>
                            <comment id="2043282" author="thomas.schubert" created="Fri, 26 Oct 2018 14:25:58 +0000"  >&lt;p&gt;Thanks for the update, &lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=icruz&quot; class=&quot;user-hover&quot; rel=&quot;icruz&quot;&gt;icruz&lt;/a&gt;.&lt;/p&gt;</comment>
                            <comment id="2043046" author="icruz" created="Fri, 26 Oct 2018 09:16:27 +0000"  >&lt;p&gt;Hi Kelsey,&lt;/p&gt;

&lt;p&gt;we have upgraded to 4.0.x and we have not seen this behavior again. So maybe we can resolve, and will reopen again if it happens again.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Isaac&lt;/p&gt;</comment>
                            <comment id="2042476" author="thomas.schubert" created="Thu, 25 Oct 2018 19:20:29 +0000"  >&lt;p&gt;Hi &lt;a href=&quot;https://jira.mongodb.org/secure/ViewProfile.jspa?name=icruz&quot; class=&quot;user-hover&quot; rel=&quot;icruz&quot;&gt;icruz&lt;/a&gt;,&lt;/p&gt;

&lt;p&gt;Is this still an issue for you? If so, I would recommend upgrading to a more recent version of MognoDB 3.6 has there have been a number of fixes in this space since this ticket was originally opened. If after upgrading you&apos;re still encountering this issue, would you please provide the stack traces while the lag is building as Bruce requested?&lt;/p&gt;

&lt;p&gt;Thank you,&lt;br/&gt;
Kelsey&lt;/p&gt;</comment>
                            <comment id="1920571" author="bruce.lucas@10gen.com" created="Thu, 14 Jun 2018 13:26:43 +0000"  >
&lt;blockquote&gt;&lt;p&gt;Regarding monitoring replication lag on the secondaries for future, is it using db.printSlaveReplicationInfo() / db.printReplicationInfo() or is there an easier way ?&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;Yes, that&apos;s the recommended way of monitoring it; details are described &lt;a href=&quot;https://docs.mongodb.com/manual/tutorial/troubleshoot-replica-sets/#check-the-replication-lag&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;here&lt;/a&gt;. Thanks for your help troubleshooting this problem.&lt;/p&gt;

&lt;p&gt;Bruce&lt;/p&gt;</comment>
                            <comment id="1920194" author="laxmanpv" created="Wed, 13 Jun 2018 22:38:36 +0000"  >&lt;p&gt;Uploaded mongos.log from 10.0.0.4 for that day and also some other commands that I had captured during the time (especially currentOp on 03a)&lt;br/&gt;
Regarding monitoring replication lag on the secondaries for future, is it using db.printSlaveReplicationInfo() / db.printReplicationInfo() or is there an easier way ? &lt;/p&gt;</comment>
                            <comment id="1919893" author="bruce.lucas@10gen.com" created="Wed, 13 Jun 2018 18:30:25 +0000"  >&lt;p&gt;Thanks for the detailed timeline. Can you upload that mongos log? I&apos;m not sure what that message is and would like to see the context.&lt;/p&gt;

&lt;p&gt;Here&apos;s what we&apos;re seeing in the diagnostic data:&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;From about 22:16 to 22:21, a very long-running transaction held a global intent lock without yielding for about 5 minutes, and this blocked replication, caused the node to lag behind the primary, and would also have prevented any read operations on the secondary from succeeding because of the stalled replication. However there is no slow operation logged that coincides with that long transaction, nor any other trace in the mongod log or diagnostic data that I have been able to find.&lt;/li&gt;
	&lt;li&gt;When this ended at 22:21, as a result of the large lag the server then soon encountered severe cache pressure while trying to catch up, and went into high-CPU mode as you noted, due to &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-34938&quot; title=&quot;Secondary slowdown or hang due to content pinned in cache by single oplog batch&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-34938&quot;&gt;&lt;del&gt;SERVER-34938&lt;/del&gt;&lt;/a&gt;. This is a known issue, but I am not going to close this ticket as a duplicate but instead use it to investigate the preceding triggering issue.&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;If you are in a position to monitor replication lag on secondaries, then that may give you an indication that such an incident is in progress. If you see lag building, please capture stack traces and db.currentOp if possible while lag is building. The hope is that we can capture this information during the triggering incident mentioned in the first bullet point above, before it enters high-CPU mode.&lt;/p&gt;

&lt;p&gt;In any case if there is a recurrence of this issue, whether on a primary or a secondary, please upload diagnostic.data and mongod log for our analysis.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Bruce&lt;/p&gt;</comment>
                            <comment id="1919236" author="laxmanpv" created="Wed, 13 Jun 2018 06:03:31 +0000"  >&lt;p&gt;On 10.0.0.4 and few other servers, we have mongos running. &lt;br/&gt;
Around that timeframe in the mongos.log on 10.0.0.4, I only see several of these from 22:23 to 22:33 &lt;br/&gt;
&quot;mongos collstats doesn&apos;t know about: ...&quot;&lt;/p&gt;</comment>
                            <comment id="1919230" author="laxmanpv" created="Wed, 13 Jun 2018 05:44:08 +0000"  >&lt;p&gt;Hi Bruce - Here&apos;s some additional information on the sequence of steps&lt;/p&gt;

&lt;p&gt;Note all times UTC&lt;/p&gt;

&lt;ol&gt;
	&lt;li&gt;03b was secondary at that time&lt;/li&gt;
	&lt;li&gt;cpu on 03b was 96%+ from 22:23:00 to 23:17:00&lt;/li&gt;
	&lt;li&gt;pstack on 03b was captured at Jun 11 23:13&lt;/li&gt;
	&lt;li&gt;restarted mongod on 03b a few seconds after capturing pstack&lt;/li&gt;
	&lt;li&gt;03a which was primary also had high cpu 75%+ from 22:23 until June 12th 02:28&lt;/li&gt;
	&lt;li&gt;Tried to run rs.stepDown() on 03a after mongod restart was complete on 03b but it failed with this error &quot;No electable secondaries caught up&quot;&lt;/li&gt;
	&lt;li&gt;Captured pstack on 03a at 23:48&lt;/li&gt;
&lt;/ol&gt;
</comment>
                            <comment id="1918719" author="bruce.lucas@10gen.com" created="Tue, 12 Jun 2018 18:36:42 +0000"  >&lt;p&gt;Hi Isaac,&lt;/p&gt;

&lt;p&gt;Thanks, it would certainly be helpful to understand whether there was anything unusual happening from an application perspective around that time.&lt;/p&gt;

&lt;p&gt;One of the possible factors contributing to the incident was a very long-running storage-engine transaction, but I have not been able to identify the cause of that (it may or may not have been related to specific application activity), from about&#160;22:15:51 to about 22:21:39. Coinciding with the start of that transaction was a connection from 10.0.0.4; what mongod or mongos processes are running on that machine? If you have the log files for those mongod or mongos processes covering the incident we may be able to get more information about the cause of the long-running transaction.&lt;/p&gt;

&lt;p&gt;Regarding the time when the stack traces were captured, perhaps if you have the original pstack-03b.txt file still on the machine where you collected the stack traces the file creation date would tell us when they were collected? As I mentioned this would be very helpful in understanding the incident as we see a couple of different distinct phases to the incident in the diagnostic.data that you uploaded, and without the timestamp I&apos;m not sure which phase the stack traces correspond to.&lt;/p&gt;

&lt;p&gt;Beyond that, if the incident occurs again can you please upload diagnostic.data and logs, and if possible stack traces with timestamp. This will help us identify common elements to the incidents.&lt;/p&gt;

&lt;p&gt;Also, is there any possibility to upgrade to the most recent version of 3.6 in order to pick up the fix to &lt;a href=&quot;https://jira.mongodb.org/browse/SERVER-32876&quot; title=&quot;Don&amp;#39;t stall ftdc due to WT cache full&quot; class=&quot;issue-link&quot; data-issue-key=&quot;SERVER-32876&quot;&gt;&lt;del&gt;SERVER-32876&lt;/del&gt;&lt;/a&gt;? In the data you uploaded the diagnostic information was not collected during a substantial portion of the incident due to that issue, so upgrading would give us more complete information.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
 Bruce&lt;/p&gt;</comment>
                            <comment id="1918583" author="icruz" created="Tue, 12 Jun 2018 17:20:51 +0000"  >&lt;p&gt;Hi Bruce,&lt;br/&gt;
basically our application continuously inserts data to a single collection (that is the main load at that time), we create a new collection every day. And sometimes we read from previous day collection all data from one lineId (that read operation is done on secondaries).&lt;br/&gt;
I don&apos;t think there was something special at that time, but I will look into it more thoroughly and will update if I find something more.&lt;/p&gt;</comment>
                            <comment id="1918568" author="bruce.lucas@10gen.com" created="Tue, 12 Jun 2018 17:07:03 +0000"  >&lt;p&gt;Thanks Isaac. What information do you have about exactly when the stack traces were captured, or a time range? There were a couple of different phases to the incident, and it would help to get a complete picture of the incident if we had information about when the stack traces were collected.&lt;/p&gt;</comment>
                            <comment id="1918344" author="icruz" created="Tue, 12 Jun 2018 14:39:32 +0000"  >&lt;p&gt;Hi Bruce,&lt;/p&gt;

&lt;p&gt;I have uploaded both files. Incident happened on Jun 11, between 22:20 and 23:18 when it was restarted (everything UTC timezone). Unfortunately data from previous incidents has been already removed.&lt;/p&gt;

&lt;p&gt;Please let me know if you need anything else.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
Isaac&lt;/p&gt;

&lt;p&gt;&#160;&lt;/p&gt;</comment>
                            <comment id="1918228" author="bruce.lucas@10gen.com" created="Tue, 12 Jun 2018 13:09:12 +0000"  >&lt;p&gt;Hi Isaac,&lt;/p&gt;

&lt;p&gt;Can you please&lt;/p&gt;
&lt;ul&gt;
	&lt;li&gt;archive and upload $dbpath/diagnostic.data from the node where the most recent incident occured? You can upload it to &lt;a href=&quot;https://10gen-httpsupload.s3.amazonaws.com/upload_forms/6815db62-2794-44ff-a5ee-2158fdbf6074.html&quot; class=&quot;external-link&quot; target=&quot;_blank&quot; rel=&quot;nofollow noopener&quot;&gt;this private secure portal&lt;/a&gt;&lt;/li&gt;
	&lt;li&gt;upload mongod log files covering the incident&lt;/li&gt;
	&lt;li&gt;provide a specific timeline for the incident (including timezone) so we can find the relevant data&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;If diagnostic.data for the earlier incidents is still available can you please do the same for those incidents. You can check whether it is still available by looking at the files in diagnostic.data; the name of each file reflects the beginning of the time range covered by that file.&lt;/p&gt;

&lt;p&gt;Thanks,&lt;br/&gt;
 Bruce&lt;/p&gt;</comment>
                    </comments>
                    <attachments>
                            <attachment id="189387" name="incident.png" size="477771" author="bruce.lucas@mongodb.com" created="Wed, 13 Jun 2018 18:40:53 +0000"/>
                            <attachment id="189322" name="pstack-03a.txt" size="531206" author="laxmanpv" created="Wed, 13 Jun 2018 05:47:02 +0000"/>
                            <attachment id="189202" name="pstack-03b.txt" size="260676" author="icruz" created="Tue, 12 Jun 2018 09:10:35 +0000"/>
                    </attachments>
                <subtasks>
                    </subtasks>
                <customfields>
                                                <customfield id="customfield_10050" key="com.atlassian.jira.toolkit:comments">
                        <customfieldname># Replies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>13.0</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                <customfield id="customfield_10055" key="com.atlassian.jira.ext.charting:firstresponsedate">
                        <customfieldname>Date of 1st Reply</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>Tue, 12 Jun 2018 13:09:12 +0000</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10052" key="com.atlassian.jira.toolkit:dayslastcommented">
                        <customfieldname>Days since reply</customfieldname>
                        <customfieldvalues>
                                        5 years, 15 weeks, 5 days ago
    
                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_18254" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Dependencies</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue><![CDATA[]]></customfieldvalue>


                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_15850" key="com.atlassian.jira.plugins.jira-development-integration-plugin:devsummary">
                        <customfieldname>Development</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_10057" key="com.atlassian.jira.toolkit:lastusercommented">
                        <customfieldname>Last comment by Customer</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>true</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_10056" key="com.atlassian.jira.toolkit:lastupdaterorcommenter">
                        <customfieldname>Last commenter</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>luke.bonanomi@mongodb.com</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_11151" key="com.atlassian.jira.toolkit:LastCommentDate">
                        <customfieldname>Last public comment date</customfieldname>
                        <customfieldvalues>
                            5 years, 15 weeks, 5 days ago
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                    <customfield id="customfield_10032" key="com.atlassian.jira.plugin.system.customfieldtypes:select">
                        <customfieldname>Operating System</customfieldname>
                        <customfieldvalues>
                                <customfieldvalue key="10026"><![CDATA[ALL]]></customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_10051" key="com.atlassian.jira.toolkit:participants">
                        <customfieldname>Participants</customfieldname>
                        <customfieldvalues>
                                        <customfieldvalue>bruce.lucas@mongodb.com</customfieldvalue>
            <customfieldvalue>icruz</customfieldvalue>
            <customfieldvalue>kelsey.schubert@mongodb.com</customfieldvalue>
            <customfieldvalue>laxmanpv</customfieldvalue>
    
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                        <customfield id="customfield_14254" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Product Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|hu0czj:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                <customfield id="customfield_12550" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>Rank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>2|htr8wn:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                <customfield id="customfield_10558" key="com.pyxis.greenhopper.jira:gh-global-rank">
                        <customfieldname>Rank (Obsolete)</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>9223372036854775807</customfieldvalue>
                        </customfieldvalues>
                    </customfield>
                                                                                            <customfield id="customfield_23361" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Requested By</customfieldname>
                        <customfieldvalues>
                                

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                            <customfield id="customfield_10750" key="com.atlassian.jira.plugin.system.customfieldtypes:textarea">
                        <customfieldname>Steps To Reproduce</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>&lt;p&gt;It has happened randomly, around 3 times in last 3 weeks: first two with a few hours difference, and on different servers (and I think it happened to primaries at that time), and then now again in a secondary.&lt;/p&gt;</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                    <customfield id="customfield_10053" key="com.atlassian.jira.ext.charting:timeinstatus">
                        <customfieldname>Time In Status</customfieldname>
                        <customfieldvalues>
                            
                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                                                                                                        <customfield id="customfield_22870" key="com.onresolve.jira.groovy.groovyrunner:scripted-field">
                        <customfieldname>Triagers</customfieldname>
                        <customfieldvalues>
                                    <customfieldvalue><![CDATA[bruce.lucas@mongodb.com]]></customfieldvalue>
    

                        </customfieldvalues>
                    </customfield>
                                                                                                                                                                                                                                                                                                                                                                                    <customfield id="customfield_14350" key="com.pyxis.greenhopper.jira:gh-lexo-rank">
                        <customfieldname>serverRank</customfieldname>
                        <customfieldvalues>
                            <customfieldvalue>1|htzz8v:</customfieldvalue>

                        </customfieldvalues>
                    </customfield>
                                    </customfields>
    </item>
</channel>
</rss>