[SERVER-64321] MongoDb instance stops accepting connections after some time working Created: 08/Mar/22 Updated: 19/Oct/22 Resolved: 17/May/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Yorman Mirabal | Assignee: | Chris Kelly |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Participants: | |||||
| Description |
|
We have a Mongo replicaset running in 3 AWS EC2 instances. Servers are running version 4.0. The isssue is that after certain period of time ( 7 - 14 days) operating flawlessly the servers stops accepting connections and we need to restart the instances to make them work again |
| Comments |
| Comment by Yorman Mirabal [ 19/Oct/22 ] | |
|
Hello! We continue experiencing the same issue but now it happens daily... sometimes twice a day. Our database has been growing steadily and with it the unresponsiveness started to happened more frequent. In order to overcome we had to develop a service that every 7 mins assess the connection to the servers in case of failure issues a restart. Our cluster runs on AWS EC2 Instances | |
| Comment by Chris Kelly [ 17/May/22 ] | |
|
Hi Yorman, We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket. Regards, | |
| Comment by Chris Kelly [ 09/May/22 ] | |
|
Hi Yorman, We still need additional information to diagnose your problem. If this is still an issue for you, we would you to provide the information previously requested. Specifically: Download the mdiag.sh script from our github repository. You must run this script on each node of your replica set. Note: This script should be run from the command line/OS shell with root/sudo privileges as follows:
It will take about 5 minutes to run in each case. Once complete, please attach the resulting /tmp/mdiag-$HOSTNAME.txt from the host(s) to the ticket. Answering the following:
Regards, | |
| Comment by Chris Kelly [ 18/Apr/22 ] | |
|
Hi Yorman, After some further discussion about your situation, we have some suspicions we'd like to look into specific to your OS/HW configuration. Specifically, the pattern of high disk utilization, page faults, iowait, and needing to restart due to unresponsiveness are characteristic of OS paging due to memory pressure. The high iowait in these instances means that nothing else can run while waiting for IO, likely because the data was paged out.
There is an OS setting "vm.swappiness" that causes the OS to proactively swap data out even if there isn't a lot of memory pressure. Normally, this is for swapping out unused heap data in favor of useful file cache data, but file cache is decreasing here. We would need you to run a diagnosis script to gather system and hardware diagnostic information to look into this further. Please download the mdiag.sh script from our github repository. You must run this script on each node of your replica set. Note: This script should be run from the command line/OS shell with root/sudo privileges as follows:
It will take about 5 minutes to run in each case. Once complete, please attach the resulting /tmp/mdiag-$HOSTNAME.txt from the host(s) to the ticket. Also, if you could clarify:
Your mdiag results and this additional information will help us narrow down exactly what's going on. Regards,
| |
| Comment by Chris Kelly [ 13/Apr/22 ] | |
|
Hi Yorman, Thanks for uploading the additional log files. I was able to see additional information in your mongo.log this time leading up to an additional incident in the logs, However, after looking through your FTDC metrics and the logs for 3/28, I wasn't able to find the same poor performance in the metrics as we found in your other instances at 3/8, 3/14. On 3/26, I saw the repeating pattern we previously saw:
The main symptom I observe is a pattern of high disk utilization and network lag in the instances I'm seeing, but I'm still missing information on the exact time you notice the nodes rejecting connections, as the logs simply stop on 3/26 and I only see the logs return briefly on 3/28 with no similar symptoms. From logs, I saw some connections being closed due to "Broken Pipe" exceptions during the lag, but this is on 3/26, and not at 3/28 as you mention. We do not observe high memory or wt cache usage by this mongod at any point, even though there are a number of queries coming in exactly when the issue starts. We see disk utilization quickly increase at points where a large number of queries come in:
At this point, it is not clear what the specific correlation is between disk usage, network lag, and the eventual failure to accept connections is in your case. The queries mentioned seem to line up with the load, but there doesn't appear to be significant cache or memory usage by mongod or the system as a whole. In order to get more information, can you clarify:
Regards, Christopher
| |
| Comment by Yorman Mirabal [ 10/Apr/22 ] | |
|
OK , used the link provided and uploaded the requested files covering last incident from March the 28th Thank you very much Chris. | |
| Comment by Chris Kelly [ 05/Apr/22 ] | |
|
Hi Yorman, If there were any additional incidents in the interim, please feel free to also add them and point out any new behaviors if they arise. Just to make sure we get all the information, please also double-check that your uploaded mongo.log file covers the time period of the incident.
Regards, Christopher | |
| Comment by Yorman Mirabal [ 28/Mar/22 ] | |
|
Hi Christopher thank you for such detailed report about the issues. As far as I can tell all the incidents have the same behavior server stops rejecting connections and in order to get it back to operation we restart it. We do have a job that runs daily which takes care of creating a database dump and uploads the file to an S3 bucket. Just that. This is the content of the dbBackup file #!/bin/bash #Force file syncronization and lock writes MONGODUMP_PATH="/usr/bin/mongodump" TIMESTAMP=`date +%F-%H%M` # Create backup
# Upload to S3 #Unlock database writes ...... Also I'm providing logs for today and yesterday , we had another incident. Thanks for all the help Chris | |
| Comment by Chris Kelly [ 24/Mar/22 ] | |
|
Thank you for your additional submission. After reviewing your metrics, we were able to notice a pattern with the incident on 03-08 as well as 03-14. The first two incidents show us significant disk utilization on nvme0n1, however it is unclear what the source of this is. The block-manager bytes read metrics seems to increase during the period of high disk utilization, but the gaps in the metrics make it difficult to tell.
03-08 At point A (2022-03-08T11:03:58.899Z), we have a sudden spike of disk usage and iowait, followed by lag in the nodes. A failover occurs at point B (2022-03-08T11:05:58.803Z), and disk utilization drops. The node is reelected back to the primary node, operating normally about 38 seconds until it starts to have high disk utilization again.
03-14 Lag starts at point D (2022-03-14T18:23:41.399Z), resuming normally after a failover at point E (2022-03-14T18:24:42.290Z). It appears to operate normally after returning back to the original primary node for about 2 minutes before a lag spike/high iowait/disk utilization return at point F (2022-03-14T18:27:50.721Z).
03-20 The previously observed metrics for nvme0n1 don’t exist at the 03-20 timestamp. However, we also don’t observe high disk utilization or gaps in the FTDC as in the other incidents, and the only data we see is a slight uptick in user process CPU usage. It’s worth noting that this restart occurred after only 5.7 days, instead of 6.2 on 03-14, and 6.7 on 03-08. The fact this is happening on a near weekly basis is also rather suspect - it’s possible there is a regularly scheduled external job of some sort running around this time since it doesn’t appear to be strongly related to the mongod at first glance. The 03-20 incident is also strange because the restart occurred before the significant disk usage was observed in the metrics. Because of gaps in the FTDC, and the lack of mongo.log data covering any of these incidents (logs were rotated before each incident took place), it is difficult to reach a conclusion on this. In order for us to garner any more information, we are going to require the mongo.log that covers this incident when it occurs. Could you please help us with the following:
Regards, Christopher | |
| Comment by Yorman Mirabal [ 21/Mar/22 ] | |
|
Hi Chris, today we had another incident , uploaded the FTDC and log files Thank you | |
| Comment by Chris Kelly [ 17/Mar/22 ] | |
|
Hi Yorman, I took a look a look at your provided diagnostic.data and found a restart that occurred around the timestamp 2022-03-14 18:23:00 UTC. Is this the restart you had to do to because it was unresponsive? Unfortunately, I do not see data in your mongo.log after 2022-03-14 03:22:05 UTC. If possible, can you supply the mongo.log which covers the above timestamp so we can investigate further? You can reuse the provided support upload link here. Thanks! | |
| Comment by Yorman Mirabal [ 15/Mar/22 ] | |
|
Hello Chris thank you for lending us a hand, I just uploaded using the provided snippet the files containing the diagnostic.data from primary server of the replicaset along with the mongo.log file from the date we had our last connection issues event. Thank you very much Chris , hope the provided information helps to find a solution to our problem. | |
| Comment by Chris Kelly [ 11/Mar/22 ] | |
|
Hi Yorman, Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Regards, |