[SERVER-75881] mongod 4.4 CPU hogs/stalls leading to client queries timedout Created: 08/Apr/23 Updated: 31/May/23 Resolved: 31/May/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | venkataramans rama | Assignee: | Yuan Fang |
| Resolution: | Done | Votes: | 24 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: | |
| Case: | (copied to CRM) |
| Description |
|
Hi Team, Since migrating to MongoDB 4.4, we have been encountering a recurring issue of client queries timing out frequently. Upon monitoring the CPU usage during these occurrences, we have noticed that mongod's CPU usage is unusually high, despite there not being a significant increase in operations.
This erratic CPU usage has been observed intermittently, and our application has a monitoring thread that checks the replicStatus and reports NO_PRIMARY during such times.
Mongo- Server Version: 4.4.18 Mongo Client Java Driver version: 3.12.9 Standalone or Replica : 4 data bearing + 1 arbiter Server Option: root 52039 161 2.5 12649240 7229872 ? S<Ll Apr06 4415:45 /usr/bin/mongod --ipv6 --slowms 500 --storageEngine wiredTiger --wiredTigerCacheSizeGB 22 --enableMajorityReadConcern false --bind_ip_all --keyFile /root/.dbkey --port 27717 --dbpath=/var/data/sessions.1/h --replSet set01h --fork --pidfilepath /var/run/sessionmgr-27717.pid --oplogSize 5120 --logpath /var/log/mongodb-27717.log --logappend --quiet Storage Engine : WT
Application Monitors the replicaStatus periodically reporting, TYPE=UNKNOWN for PRIMARY intermittently. 2023-04-07 23:02:30,944 [pool-2-thread-1] replicaset status: ReplicaSetStatus{name=set01h, cluster=ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=nd2bwa3psm16vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion {versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=795436, setName='set01h', canonicalAddress=nd2bwa3psm16vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414267859844}, ServerDescription{address=nd2bwa3psm15va:27717, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=nd2bwa3psm16va:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1152849, setName='set01h', canonicalAddress=nd2bwa3psm16va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414649210810}, ServerDescription{address=nd2bwa3psm15vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion {versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=943462, setName='set01h', canonicalAddress=nd2bwa3psm15vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414619113856}]}}Immediately in the next sec, the replicaSet detects PRIMARY 2023-04-07 23:02:31,260 [pool-2-thread-1] replicaset status: ReplicaSetStatus{name=set01h, cluster=ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=nd2bwa3psm16vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]} , minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=772997, setName='set01h', canonicalAddress=nd2bwa3psm16vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116414767866623}, ServerDescription{address=nd2bwa3psm15va:27717, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=16803136, setName='set01h', canonicalAddress=nd2bwa3psm15va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=7fffffff0000000000000080, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116414870305431}, ServerDescription{address=nd2bwa3psm16va:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion {versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1152849, setName='set01h', canonicalAddress=nd2bwa3psm16va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414649210810}, ServerDescription{address=nd2bwa3psm15vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=943462, setName='set01h', canonicalAddress=nd2bwa3psm15vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414619113856}]}}
But again moves to UKNOWN and later to a proper state 2023-04-07 23:02:31,544 [pool-2-thread-1] replicaset status: ReplicaSetStatus{name=set01h, cluster=ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=nd2bwa3psm16vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion {versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=729934, setName='set01h', canonicalAddress=nd2bwa3psm16vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415267757469}, ServerDescription{address=nd2bwa3psm15va:27717, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=nd2bwa3psm16va:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1059673, setName='set01h', canonicalAddress=nd2bwa3psm16va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415149889939}, ServerDescription{address=nd2bwa3psm15vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion {versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=941532, setName='set01h', canonicalAddress=nd2bwa3psm15vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415120136591}]}}2023-04-07 23:02:31,764 [pool-3-thread-1] replicaset status: ReplicaSetStatus{name=set01h, cluster=ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=nd2bwa3psm16vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]} , minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=729934, setName='set01h', canonicalAddress=nd2bwa3psm16vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415267757469}, ServerDescription{address=nd2bwa3psm15va:27717, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=13549634, setName='set01h', canonicalAddress=nd2bwa3psm15va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=7fffffff0000000000000080, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415370985974}, ServerDescription{address=nd2bwa3psm16va:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion {versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1059673, setName='set01h', canonicalAddress=nd2bwa3psm16va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415149889939}, ServerDescription{address=nd2bwa3psm15vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=941532, setName='set01h', canonicalAddress=nd2bwa3psm15vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415120136591}]}} top - 23:02:27 up 1 day, 7:17, 0 users, load average: 21.07, 23.62, 26.89 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND – PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND – PID : 52039 mongostats
The response time goes to minutes ns total read write 2023-04-07T23:02:27Z ns total read write 2023-04-07T23:02:29Z ns total read write 2023-04-07T23:02:30Z ns total read write 2023-04-07T23:02:31Z ns total read write 2023-04-07T23:02:32Z ns total read write 2023-04-07T23:02:33Z ns total read write 2023-04-07T23:02:34Z
CPU idle% goes to 0.0 11:02:09 PM CPU %user %nice %system %iowait %steal %idle Linux 4.18.0-372.26.1.el8_6.x86_64 (nd2bwa3psm15va) 04/07/2023 x86_64 (10 CPU) 11:02:09 PM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15 blocked
SM16va: SECONDARY member mongostat insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
15vb:SECONDARY member mongostat insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
16vb:SECONDARY member mongostat nsert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
Since the PRIMARY seems unavailable, infact NO primaries are seen for 1-2 sec, the application(client) queries are getting timedout resulting in bad customer experience. We tried to reduce the load(query) to the mongo but it looks the same. We use /etc/hosts for name resolution and do not use DNS at all. We need to know why time to time, mongod hogs.
Please provide a secure portal, I can share the diagnostics.data and complete mongodb log during this period. |
| Comments |
| Comment by Yuan Fang [ 31/May/23 ] |
|
Closing this ticket as further investigation has been moved to another thread. |
| Comment by venkataramans rama [ 11/Apr/23 ] |
|
Thanks a lot Yuan for the analysis. Yes the date was April 7. Sorry for the typo. Yes, we also see primary insert/update reduced during this time but we believe its due to Mongo high CPU and not reachable to clients. Also with the same setup if we run with mongo 4.2/4.0 we are not observing this issue. So we believe its something unique to mongo 4.4 and hence we feel a regression issue. Also this is not just happening in one setup but multiple setups. So we don't feel its related to network issue. Also mongod primary reports IDHACK queries (optimized queries) also as Slow query because, its doing some heavy operations (which we don't know what) which is causing the HIGH CPU and Slow query response for even IDHACK queries. Please let me know if you need any additional details to find the cause of this problem.
|
| Comment by Yuan Fang [ 10/Apr/23 ] |
|
Hi veramasu@hcl.com, Thank you for providing the diagnostic data,
I noticed that in your comment you referred to the timeline April 9th, but I believe you meant to refer to April 7th instead, since the FTDC data does not cover April 9th.
Based on my analysis of the FTDC data from the primary node, I observed some slow queries and increased latencies around point B (2023-04-07T23:02:29.747Z), which appear to be caused by network issues. This is supported by the decrease in ss network bytesIn and ss network bytesOut between points A and B. The increase in CPU usage is likely due to the server attempting to handle these slow queries. Additionally, while the status of the primary node's self state remained healthy, it was seen as down from secondary node 4 during a period after point B, which I suspect may also be related to network issues.
I'm unsure if you meant to suggest a regression after upgrading. If so, could you please clarify and provide more information? Regards, |
| Comment by venkataramans rama [ 10/Apr/23 ] |
|
Thanks Team for the portal link. I have uploaded all the 4 members mongod.log and diagnostics.data Please refer to the timeline April 9th from 23:02:10 to 23:02:38. |
| Comment by Yuan Fang [ 10/Apr/23 ] |
|
Hi veramasu@hcl.com, I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time. For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:
Regards, |