[SERVER-82398] Excessive memory and CPU consumption during normal operation Created: 24/Oct/23 Updated: 01/Jan/24 |
|
| Status: | Investigating |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Vladimir Beliakov | Assignee: | Yuan Fang |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 18.04.6 LTS |
||
| Attachments: |
|
| Assigned Teams: |
Server Triage
|
| Operating System: | ALL |
| Participants: |
| Description |
|
Hi! On one of our shards at some point in time the consumed memory of the primary replica started to rapidly grow along with high CPU consumption. Then that replica became unresponsive, and consequently another replica became the primary. Right after that the same happened to the new primary. The incident timeline:
Unfortunately, we couldn't get to the core of the problem, but here some things we could observe:
Could you help us identify the cause of the problem? I'm attaching the diagnosting data of the aforementioned replicas (I named the files with replica-1, replica-2 and replica-3, these names correspond to the replica numbers mentioned above). |
| Comments |
| Comment by Yuan Fang [ 14/Dec/23 ] |
|
Thank you for your report and for providing us with the FTDC. When looking at the FTDC of replica-1 (primary), starting from A (2023-10-23T07:40:06.497Z): There is an increase in CPU utilization, escalating cache pressure (cache fill ratio exceeding 80%), and a rise in cache eviction (ss wt cache pages queued for eviction, ss wt cache unmodified pages evicted). The read tickets become exhausted, and it appears they are waiting for the cache to become available (ss wt thread-yield page acquire time sleeping). The readers are queued up, and, as you also mentioned, the ss metrics cursor open pinned increased to 415, indicating active use by running queries. I suspect there may be a read workload change at A ( even though there is not a clear event seen from ss opcounters query), however, there are many spikes in ss metrics queryExecutor scanned after A. Do you know which queries are responsible for this?? I recommend gaining a better understanding of the workload during the incident. Could you please review the logs and check for any interesting slow queries on replica-1 in the period of A-B? Regards, |
| Comment by Vladimir Beliakov [ 25/Oct/23 ] |
|
I forgot to mention that our MongoDB version is 5.0.14 |