[SERVER-82398] Excessive memory and CPU consumption during normal operation Created: 24/Oct/23  Updated: 01/Jan/24

Status: Investigating
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Vladimir Beliakov Assignee: Yuan Fang
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 18.04.6 LTS
XSF
Kernel - 5.4.0-1088-aws #96~18.04.1-Ubuntu SMP Mon Oct 17 02:57:48 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Disable Transparent Huge disabled
AWS m5.2xlarge
SSD GP3 450 Gb


Attachments: Zip Archive 1 - replica-1 (was primary before the incident happened).zip     Zip Archive 2 - replica-2 (became primary during the incident).zip     Zip Archive 3 - replica-3 (didn't notice that it was primary, but had same problems).zip     PNG File SERVER-82398.png    
Assigned Teams:
Server Triage
Operating System: ALL
Participants:

 Description   

Hi!

On one of our shards at some point in time the consumed memory of the primary replica started to rapidly grow along with high CPU consumption. Then that replica became unresponsive, and consequently another replica became the primary. Right after that the same happened to the new primary.

The incident timeline:

  1. 10/24/23 7:40 - beginning (peak in CPU and memory consumption)
  2. 10/24/23 8:20-8:26 (can't say exact time) - the primary (replica-1) becomes unresponsive, another replica (replica-2) becomes the new primary, and we see peak in CPU and memory consumption again
  3. 10/24/23 8:38 - the new primary (replica-2) becomes unresponsive, another replica (replica-1) becomes the new primary
  4. 10/24/23 8:43 - the replica (replica-3) that didn't appear to ever assume the primary role starts experience the same problems with CPU and memory
  5. 10/24/23 9:20 - we manually restart replica-3, the incident ends

Unfortunately, we couldn't get to the core of the problem, but here some things we could observe:

  • we noticed that the amount of open cursors jumped up to 500 at the replicas mentioned above (we use Change Streams, so it might be related)
  • On replica-3 there were dozens of "hanging" aggregation commands (in secs_running we saw pretty big numbers, like 2000 seconds)

Could you help us identify the cause of the problem?

I'm attaching the diagnosting data of the aforementioned replicas (I named the files with replica-1, replica-2 and replica-3, these names correspond to the replica numbers mentioned above).



 Comments   
Comment by Yuan Fang [ 14/Dec/23 ]

Hi vladimirred456@gmail.com,

Thank you for your report and for providing us with the FTDC. When looking at the FTDC of replica-1 (primary), starting from A (2023-10-23T07:40:06.497Z):

There is an increase in CPU utilization, escalating cache pressure (cache fill ratio exceeding 80%), and a rise in cache eviction (ss wt cache pages queued for eviction, ss wt cache unmodified pages evicted). The read tickets become exhausted, and it appears they are waiting for the cache to become available (ss wt thread-yield page acquire time sleeping).

The readers are queued up, and, as you also mentioned, the ss metrics cursor open pinned increased to 415, indicating active use by running queries. I suspect there may be a read workload change at A ( even though there is not a clear event seen from ss opcounters query), however, there are many spikes in ss metrics queryExecutor scanned after A. Do you know which queries are responsible for this??

I recommend gaining a better understanding of the workload during the incident. Could you please review the logs and check for any interesting slow queries on replica-1 in the period of A-B?

Regards,
Yuan

Comment by Vladimir Beliakov [ 25/Oct/23 ]

I forgot to mention that our MongoDB version is 5.0.14

Generated at Thu Feb 08 06:49:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.