[SERVER-51684] Every 4 hours CPU usage is going high varying from 100% to 200% because of mongo SECONDARY / replica node reciding on server Created: 16/Oct/20  Updated: 28/Oct/20  Resolved: 28/Oct/20

Status: Closed
Project: Core Server
Component/s: Networking, Performance, Replication
Affects Version/s: 4.0.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Prakhar Makhija Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2020-10-23 at 11.55.14 AM.png     File mongo-open-connections.out     File mongod-2020-10-19.log     File mongod-2020-10-20.log    
Operating System: ALL
Participants:

 Description   

 
cat /var/log/mongodb/mongod.log | grep "Successfully connected to mongo1.production.internal:27017" | sort | uniq

2020-10-16T00:06:57.310+0000 I NETWORK [LogicalSessionCacheRefresh] Successfully connected to mongo1.production.internal:27017 (527 connections now open to mongo1.production.internal:27017 with a 0 second timeout)
2020-10-16T04:06:57.422+0000 I NETWORK [LogicalSessionCacheRefresh] Successfully connected to mongo1.production.internal:27017 (528 connections now open to mongo1.production.internal:27017 with a 0 second timeout)
2020-10-16T08:06:56.913+0000 I NETWORK [LogicalSessionCacheRefresh] Successfully connected to mongo1.production.internal:27017 (529 connections now open to mongo1.production.internal:27017 with a 0 second timeout)

 
mongod --version

db version v4.0.5
git version: 3739429dd92b92d1b0ab120911a23d50bf03c412
OpenSSL version: OpenSSL 1.0.2n 7 Dec 2017
allocator: tcmalloc
modules: none
build environment:
distmod: ubuntu1404
distarch: x86_64
target_arch: x86_64

mongo --version

git version: 3739429dd92b92d1b0ab120911a23d50bf03c412
OpenSSL version: OpenSSL 1.0.2n 7 Dec 2017
allocator: tcmalloc
modules: none
build environment:
distmod: ubuntu1404
distarch: x86_64
target_arch: x86_64

 

 



 Comments   
Comment by Prakhar Makhija [ 27/Oct/20 ]

Hi @edwin.zhou@mongodb.com

Thanks for sharing the analysis, and guiding me in the right direction

Will look into it

Regards,
Prakhar

Comment by Edwin Zhou [ 23/Oct/20 ]

Hi pmakhija@ask-fast.com,

Thank you for sharing your diagnostics with us. After some investigating, we suspect that an external process is increasing your system's reported CPU load.

Here is an analysis of the timeline as evidence of this:

Point A

System memory and CPU usage begins to increase at exactly 00:00 indicated by system memory anon. The mongod does not perform scheduled tasks like this. As system memory increases, memory allocated by the mongod stays unchanged as indicated by current_allocated_bytes. There is also an increase in CPU iowait, which isn't exactly CPU usage. Instead, this means that the CPU is needing to wait for i/o, as confirmed by the disk usage. However, looking at block-manager read/write, the mongod isn't doing more i/o.

Point B

Resident memory used by the mongod begins to decrease but allocated memory remains unchanged. We suspect an external process has allocated memory which caused paging and memory used by the mongod is forced out.

Point C
Memory used by the mongod returns to normal and it appears the external process has completed.

The mongod does not perform any tasks that occur at an exact time, i.e. exactly at midnight and every 4 hours. Since this appears to be a scheduled task, you should check to see if there are any running cron jobs that may be affecting your system. You can also run ps or top during the times that CPU iowait is high to identify running processes that increase load.

Best,
Edwin

Comment by Prakhar Makhija [ 20/Oct/20 ]

Hi @Bruce Lucas @Eric Sedor

My bad, the link was not public then, it is now

diagnostic.data.tar.gz

_____________________________________________

Regarding TimeZones

I fall in IST (GMT+5:30)

The SNMP alert email specifies CEST (GMT+2:00)

The mongod server is residing in Europe (Ireland) - Amazon Region (eu-west-1)
The mongod server replica node logs are being generated in GMT

_____________________________________________

SNMP Email Alerts

Date/Time: Tue Oct 20 14:09:03 CEST 2020
CRITICAL - CPU Load (5 min average): 128 %

Date/Time: Tue Oct 20 10:10:51 CEST 2020
CRITICAL - CPU Load (5 min average): 162 %

Date/Time: Tue Oct 20 02:13:28 CEST 2020
CRITICAL - CPU Load (5 min average): 138 %

Date/Time: Mon Oct 19 22:10:16 CEST 2020
CRITICAL - CPU Load (5 min average): 154 %

Date/Time: Mon Oct 19 18:12:04 CEST 2020
CRITICAL - CPU Load (5 min average): 182 %

Date/Time: Mon Oct 19 14:08:52 CEST 2020
CRITICAL - CPU Load (5 min average): 130 %

Date/Time: Mon Oct 19 10:10:42 CEST 2020
CRITICAL - CPU Load (5 min average): 176 %

Date/Time: Mon Oct 19 06:12:26 CEST 2020
CRITICAL - CPU Load (5 min average): 170 %

Date/Time: Mon Oct 19 02:14:10 CEST 2020
CRITICAL - CPU Load (5 min average): 145 %

Date/Time: Sun Oct 18 22:10:59 CEST 2020
CRITICAL - CPU Load (5 min average): 180 %

Date/Time: Sun Oct 18 18:12:44 CEST 2020
CRITICAL - CPU Load (5 min average): 208 %

Date/Time: Sun Oct 18 14:09:30 CEST 2020
CRITICAL - CPU Load (5 min average): 122 %

Date/Time: Sun Oct 18 10:11:19 CEST 2020
CRITICAL - CPU Load (5 min average): 206 %

Date/Time: Sun Oct 18 06:13:05 CEST 2020
CRITICAL - CPU Load (5 min average): 164 %

_____________________________________________

Server Mongo Logs

mongo-open-connections.out

mongod-2020-10-20.log

mongod-2020-10-19.log

Comment by Bruce Lucas (Inactive) [ 20/Oct/20 ]

Hi pmakhija@ask-fast.com, thanks for collecting the data. Unfortunately the link you provided produces an access denied error when we try to use it.

Can you please attach the data to this ticket? If it is too large to attach, you can upload it to this secure private upload portal.

Also, can you please provide a timeline of your issue: which node is seeing the problem at what specific time (including timezone)?

Comment by Prakhar Makhija [ 20/Oct/20 ]

Hi @eric.sedor , please find the tarball attached

 diagnostic.data.tar.gz

Feel free to let me know if anything else is needed

Thanks,

Prakhar

Comment by Eric Sedor [ 19/Oct/20 ]

Hi pmakhija@ask-fast.com, can you please provide more information about the issue you are reporting? High CPU use could have many causes.

As well, would you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket?

Thanks,
Eric

Generated at Thu Feb 08 05:26:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.