[SERVER-64321] MongoDb instance stops accepting connections after some time working Created: 08/Mar/22  Updated: 19/Oct/22  Resolved: 17/May/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Yorman Mirabal Assignee: Chris Kelly
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 03-28 Full-1.png     PNG File 03-28 Full.png     PNG File 308.png     PNG File 314.png     PNG File 320.png     PNG File Capture Big Picture.PNG     PNG File Capture Close Up.PNG     PNG File cyclic-1.png     PNG File cyclic.png     PNG File image-2022-03-22-15-57-16-139.png     PNG File image-2022-03-22-16-02-37-962.png     PNG File image-2022-03-22-16-03-35-435.png     PNG File image-2022-03-22-16-04-07-759.png     PNG File image-2022-03-22-16-56-08-952.png     PNG File image-2022-03-22-17-01-40-505.png     PNG File image-2022-03-23-07-26-46-029.png     PNG File image-2022-03-23-07-27-31-029.png     PNG File image-2022-03-23-07-31-41-195.png     PNG File image-2022-04-13-19-37-18-051.png    
Issue Links:
Related
Operating System: ALL
Participants:

 Description   

We have a Mongo replicaset  running  in  3 AWS EC2  instances.  Servers are running version 4.0.  The isssue is that after certain period of time ( 7 - 14 days) operating flawlessly the servers stops accepting connections and we need to restart the instances to make them work again



 Comments   
Comment by Yorman Mirabal [ 19/Oct/22 ]

Hello! We continue experiencing the same issue but now it happens daily... sometimes twice a day.  Our database has been growing steadily and with it the unresponsiveness started to happened more frequent.  In order to overcome we had to develop a service that every 7 mins assess the connection to the servers in case of failure issues a restart.

Our cluster runs on AWS EC2 Instances

Comment by Chris Kelly [ 17/May/22 ]

Hi Yorman,

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Christopher

Comment by Chris Kelly [ 09/May/22 ]

Hi Yorman,

We still need additional information to diagnose your problem. If this is still an issue for you, we would you to provide the information previously requested. Specifically:

Download the mdiag.sh script from our github repository. You must run this script on each node of your replica set.

Note: This script should be run from the command line/OS shell with root/sudo privileges as follows:

sudo bash mdiag.sh SERVER-XXXXX

It will take about 5 minutes to run in each case. Once complete, please attach the resulting /tmp/mdiag-$HOSTNAME.txt from the host(s) to the ticket.

Answering the following:

  • Do ALL of your nodes exhibit the same connection issues you demonstrated as the one you've been submitting log data for?
  • Are you running mongod in docker by any chance?

Regards,
Christopher

Comment by Chris Kelly [ 18/Apr/22 ]

Hi Yorman,

After some further discussion about your situation, we have some suspicions we'd like to look into specific to your OS/HW configuration.

Specifically, the pattern of high disk utilization, page faults, iowait, and needing to restart due to unresponsiveness are characteristic of OS paging due to memory pressure. The high iowait in these instances means that nothing else can run while waiting for IO, likely because the data was paged out.

  • mongod memory usage ("resident") increases to about 7 GB simply as a result of the WT cache filling ("bytes currently in the cache", "cache fill ratio") to about 50%. This looks normal.
  • Total system heap usage ("memory anon total") is also only about 7 GB, i.e. there isn't another process on the system using a lot of memory.
  • The o/s thinks there is plenty of free memory ("system memory free"), confirming that there isn't anything using a lot of memory.
  • However "system memory file total", "system memory cached" show the o/s disk cache first filling, then decreasing, which possibly means that the o/s has begun thinking there's memory pressure.

There is an OS setting "vm.swappiness" that causes the OS to proactively swap data out even if there isn't a lot of memory pressure. Normally, this is for swapping out unused heap data in favor of useful file cache data, but file cache is decreasing here.

We would need you to run a diagnosis script to gather system and hardware diagnostic information to look into this further. Please download the mdiag.sh script from our github repository. You must run this script on each node of your replica set.

Note: This script should be run from the command line/OS shell with root/sudo privileges as follows:

sudo bash mdiag.sh SERVER-XXXXX

It will take about 5 minutes to run in each case. Once complete, please attach the resulting /tmp/mdiag-$HOSTNAME.txt from the host(s) to the ticket.

Also, if you could clarify:

  • Do ALL of your nodes exhibit the same connection issues you demonstrated as the one you've been submitting log data for?
  • Are you running mongod in docker by any chance?

Your mdiag results and this additional information will help us narrow down exactly what's going on.

Regards,
Christopher

 

Comment by Chris Kelly [ 13/Apr/22 ]

Hi Yorman,

Thanks for uploading the additional log files. I was able to see additional information in your mongo.log this time leading up to an additional incident in the logs,
specifically at about 2022-03-26T20:15:04.677Z UTC. This timestamp follows the patterns indicated at other dates you mentioned.

However, after looking through your FTDC metrics and the logs for 3/28, I wasn't able to find the same poor performance in the metrics as we found in your other instances at 3/8, 3/14. On 3/26, I saw the repeating pattern we previously saw:

  • A large amount of WiredTiger transactions begin at 2022-03-26T20:15:04.677Z (when a large amount of aggregate queries come in)
  • This is followed immediately by high disk utilization
  • Ping of the nodes increase, along with a stark increase in page faults (which typically imply overcommitment of memory)
  • Primary nodes fails over to secondary, and rolls back as a result of severely degraded performance at 2022-03-26T20:17:59.091Z. Disk utilization temporarily goes down.
  • High disk utilization resumes up until FTDC metrics stop being recorded at point G (2022-03-26T20:24:37.642Z)

The main symptom I observe is a pattern of high disk utilization and network lag in the instances I'm seeing, but I'm still missing information on the exact time you notice the nodes rejecting connections, as the logs simply stop on 3/26 and I only see the logs return briefly on 3/28 with no similar symptoms. From logs, I saw some connections being closed due to "Broken Pipe" exceptions during the lag, but this is on 3/26, and not at 3/28 as you mention.

We do not observe high memory or wt cache usage by this mongod at any point, even though there are a number of queries coming in exactly when the issue starts. We see disk utilization quickly increase at points where a large number of queries come in:

  • at Point C (2022-03-26T20:15:04.677Z), dozens of aggregates all at once, followed by lag
  • at Point F (2022-03-26T20:21:45.437Z), findAndModify command followed by lag again

 

At this point, it is not clear what the specific correlation is between disk usage, network lag, and the eventual failure to accept connections is in your case. The queries mentioned seem to line up with the load, but there doesn't appear to be significant cache or memory usage by mongod or the system as a whole.

In order to get more information, can you clarify:

  • How are you connecting to the nodes to observe this behavior (connections being refused)?
  • What exactly is happening at the point the nodes are rejecting connections? Specifically, what is the output you see? 
  • Do you have more specific times that you can share? Do they align with the timestamps that have been provided so far showing the high disk utilization and network lag?
  • Was the node accessible at all on 3/27? There are no metrics for this time period. There is a gap from Point G (2022-03-26T20:24:37.642Z) to 3/28 (2022-03-28T11:45:43.631Z).

 

Regards,

Christopher

 

Comment by Yorman Mirabal [ 10/Apr/22 ]

OK , used the link provided and uploaded the requested files covering last incident from March the 28th 

Thank you very much Chris.

Comment by Chris Kelly [ 05/Apr/22 ]

Hi Yorman,

 
It appears that the logs aren't present in the upload box, potentially due to it having the same name as the files in the box already. I've gone ahead and generated a new link for you to submit the logs to that should fix that.

If there were any additional incidents in the interim, please feel free to also add them and point out any new behaviors if they arise. Just to make sure we get all the information, please also double-check that your uploaded mongo.log file covers the time period of the incident. 

 

Regards,

Christopher

Comment by Yorman Mirabal [ 28/Mar/22 ]

Hi Christopher thank you for such detailed report about the issues.  As far as I can tell all the incidents have the same behavior server stops rejecting connections  and in order to get it back to operation we restart it.

We do have a job that runs daily which takes care of creating a database dump and uploads  the  file to an S3 bucket. Just that.

This is the content of the dbBackup file

#!/bin/bash

#Force file syncronization and lock writes
mongo admin --username smb --password "<<password>>" --eval "printjson(db.fsyncLock())"

MONGODUMP_PATH="/usr/bin/mongodump"
MONGO_DATABASE="<<dbName>>"

TIMESTAMP=`date +%F-%H%M`
S3_BUCKET_NAME="<<bucket-name>>"

# Create backup
$MONGODUMP_PATH -username=smb --password="<<password>>" --db=$MONGO_DATABASE --authenticationDatabase=admin --archive=./mongodb$MONGO_DATABASE-$TIMESTAMP.gz --gzip

 

# Upload to S3
export AWS_ACCESS_KEY_ID=<<Access Key>>
export AWS_SECRET_ACCESS_KEY=<<Secret Key>>
export AWS_DEFAULT_REGION=us-west-2
s3cmd put mongodb-$MONGO_DATABASE-$TIMESTAMP.gz s3://$S3_BUCKET_NAME/mongodb-$MONGO_DATABASE-$TIMESTAMP.gz

#Unlock database writes
mongo admin --username smb --password "<<password>>" --eval "printjson(db.fsyncUnlock())"
#Delete local files
rm rf mongodb*
#restart
reboot

......

Also I'm providing logs for today and yesterday , we had another incident.

Thanks for all the help Chris

Comment by Chris Kelly [ 24/Mar/22 ]

Hi yorman.m@smartbarrel.io,

Thank you for your additional submission. After reviewing your metrics, we were able to notice a pattern with the incident on 03-08 as well as 03-14. The first two incidents show us significant disk utilization on nvme0n1, however it is unclear what the source of this is. The block-manager bytes read metrics seems to increase during the period of high disk utilization, but the gaps in the metrics make it difficult to tell.

 

03-08

At point A (2022-03-08T11:03:58.899Z), we have a sudden spike of disk usage and iowait, followed by lag in the nodes. A failover occurs at point B (2022-03-08T11:05:58.803Z), and disk utilization drops. The node is reelected back to the primary node, operating normally about 38 seconds until it starts to have high disk utilization again.

03-14

Lag starts at point D (2022-03-14T18:23:41.399Z), resuming normally after a failover at point E (2022-03-14T18:24:42.290Z). It appears to operate normally after returning back to the original primary node for about 2 minutes before a lag spike/high iowait/disk utilization return at point F (2022-03-14T18:27:50.721Z).

 

03-20

The previously observed metrics for nvme0n1 don’t exist at the 03-20 timestamp. However, we also don’t observe high disk utilization or gaps in the FTDC as in the other incidents, and the only data we see is a slight uptick in user process CPU usage. It’s worth noting that this restart occurred after only 5.7 days, instead of 6.2 on 03-14, and 6.7 on 03-08.

The fact this is happening on a near weekly basis is also rather suspect - it’s possible there is a regularly scheduled external job of some sort running around this time since it doesn’t appear to be strongly related to the mongod at first glance. The 03-20 incident is also strange because the restart occurred before the significant disk usage was observed in the metrics. 

Because of gaps in the FTDC, and the lack of mongo.log data covering any of these incidents (logs were rotated before each incident took place), it is difficult to reach a conclusion on this. In order for us to garner any more information, we are going to require the mongo.log that covers this incident when it occurs. 

Could you please help us with the following:

  • Explain what behavior occurred at 03-20 that triggered the restart? Was behavior different than in 03-08 and 03-14? 
  • Send us another FTDC and mongo.log covering the timeframe of the incident when it occurs again?

 

Regards,

Christopher

Comment by Yorman Mirabal [ 21/Mar/22 ]

Hi Chris, today we had another incident , uploaded the FTDC  and log files

Thank you

Comment by Chris Kelly [ 17/Mar/22 ]

Hi Yorman,

I took a look a look at your provided diagnostic.data and found a restart that occurred around the timestamp 2022-03-14 18:23:00 UTC. Is this the restart you had to do to because it was unresponsive?

Unfortunately, I do not see data in your mongo.log after 2022-03-14 03:22:05 UTC. If possible, can you supply the mongo.log which covers the above timestamp so we can investigate further?
Or, if the incident occurs again, can you reupload the FTDC and log files for the new incident?

You can reuse the provided support upload link here.

Thanks!
Christopher

Comment by Yorman Mirabal [ 15/Mar/22 ]

Hello Chris thank you for lending us a hand, I just uploaded using the provided snippet the files containing the diagnostic.data  from primary server of the replicaset along with the mongo.log file from the date we had our last connection issues event.

Thank you very much Chris , hope the provided information helps to find a solution to  our problem. 

Comment by Chris Kelly [ 11/Mar/22 ]

Hi Yorman,

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. 

Regards,
Christopher

Generated at Thu Feb 08 06:00:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.