[SERVER-45034] [ftdc] serverStatus was very slow Created: 09/Dec/19 Updated: 11/Feb/20 Resolved: 11/Feb/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 4.0.9 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Afonso Rodrigues | Assignee: | Dmitry Agranat |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Backwards Compatibility: | Fully Compatible |
| Participants: |
| Description |
|
I recive the error in my mongodb server: [ftdc] serverStatus was very slow. But I don't identify the root cause. |
| Comments |
| Comment by Dmitry Agranat [ 11/Feb/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi afonso.rodrigues@maxmilhas.com.br, We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket. Regards, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 27/Jan/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi afonso.rodrigues@maxmilhas.com.br, We still need additional information to diagnose the problem. If this is still an issue for you, would you please provide the requested information? Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 14/Jan/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi afonso.rodrigues@maxmilhas.com.br, we will need to collect a fresh set of data. After investigating the latest set, I've noticed that you've increased the number of read and write WiredTiger tickets from the default 128 to 5000. In general, we do not recommend changing the tickets number as it does not help remove any bottleneck. In this specific workload, I suspect this change might be actually causing some additional bottlenecks we have not seen before. Please set these two to their default (128) on the server where data and journal are separated, let it run for a couple of days and re-upload the usual set of data: Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 13/Jan/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks afonso.rodrigues@maxmilhas.com.br, I am going to look into the latest data soon. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 09/Jan/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, Thank's for your question. After change server, I receive the new incidents. I change the server how troubleshooting, but its not ideal. I have the three server for application switches, but I will send the files only the server with split device to log, journal and data. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 07/Jan/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi afonso.rodrigues@maxmilhas.com.br, did you notice new occurrences of "serverStatus was very slow" events after separating data and journal? Could you upload the fresh set of diagnostic.data for us to review? The graphs here are being generated by our internal tools. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 03/Jan/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, I modified other instance to sperate data and journal storages. I mounted the filesystems with in separate devices like:
I will observe my workload a few days. And, how you create this graphcis below? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 02/Jan/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi afonso.rodrigues@maxmilhas.com.br, thanks for the update. After the profiling was disabled, we are now able to see the next bottleneck. This bottleneck is related to co-locating data and journal and your access pattern, which we do not recommend.
Separating data and journal into different storage devices should address this bottleneck. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 02/Jan/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, I uploaded the files with information. Thank so much. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 02/Jan/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, Happy new year. I will the upload the diagnostic.data. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 02/Jan/20 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi afonso.rodrigues@maxmilhas.com.br, can you update us on the progress and upload the diagnostic.data for us to review the current status? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 24/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, Thank's for your update. I waiting a few days to send the new metrics. Thank's so much. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 24/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi afonso.rodrigues@maxmilhas.com.br, after checking the uploaded data, the profiling was still on during both mentioned incidents up until the process restart. After the restart, the profiling shows as off and no incidents were observed. Please let it run for a few days and upload the fresh set of data. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 23/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, Unfortunately today had two incidents. I send the present metrics, but I don't have the perf metrics. Thank's so much. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 23/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, Thank's for your update. I changed the value in 23/12/2019 at 10:55 UTC.
I waiting a few days to send the new metrics. Thank's so much. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 23/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi afonso.rodrigues@maxmilhas.com.br, to disable profiling, use the following helper in the mongo shell:
To view the profiling level, issue the following from the mongo shell:
Example output:
The was field indicates the current profiling level, so if it shows "0:, it means profiling is disabled. You can review our documentation to see how to change other parameters, such as slowms and sampleRate Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 22/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima I testing in the other instance the configuration for stop the profiling. In my server is actually configuration:
It's below the testing using the same version server: 4.0.9
The same query is logged in logfile when with mode slowOp or off. operationProfiling: | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 22/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks afonso.rodrigues@maxmilhas.com.br for uploading all the collected perf data. We suspect the erratic behavior you've reported is related to the fact that the profiler is enabled all the time in your system. Specifically, this is happening when we need to truncate the system.profiler collection (since it's a capped collection). Can you disable profiling, let the workload run for a few days and upload a fresh set of diagnostic.data? Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 20/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, I uploaded the file collect-20191219.tar.gz with the files from 20191219. Thank's so much. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 20/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, Fortunately, I don't have the critical incident in the last days. In my last action for workaround the problem, I implemented the scheduler the restart server every 4 hours. this not was effective. But I added in my application layer a anti-bot waf in 17/12/2019. I will the upload files reference 19/12/2019 with perf files and MongoD files. Thank's | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 18/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, I initiate collect the information from perf. But, in my session I receive this messages:
The collect is active. I waiting the next incident for the send the files for you. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 18/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, Thanks for your answer. I will the collected a information. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 18/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks afonso.rodrigues@maxmilhas.com.br, the information provided was very helpful. In order for us to better understand what's going on, we'd like to collect some perf call stacks. This will require the installation of Linux perf tool. Execute this during your typical workload for an hour or two. It is important to catch at least several events when "serverStatus" is reported as slow. This command will capture call stack samples in separate files of 60 seconds each:
Then run perf script as above on the subset of files of interest
Once completed, please upload all of the generated .txt files and a fresh archive of the diagnostic.data and mongod log. Note that it is important to run perf script on the same node where perf.data was generated so that it can be correctly symbolized using the addresses on that machine. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 17/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, The instance type is a i3.16xlarge. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 17/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks afonso.rodrigues@maxmilhas.com.br, I'll have a look at the latest information and will report back with my findings. Could you clarify what AWS instance type you are using for mm-mongodb-availability-02-2 node? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 16/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, This file collect-20191212.tar.gz its correct? My apologise the last message, i don't told the information when change TCP mem. Thank's | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 13/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, My environment is a single instance. I uploaded this file collect-20191212.tar.gz with last metrics and files. My timezone is UTC-03 but timezone in my server is UTC. I'm not using ReplicaSet because my server persists the ephemeral data with ttl 1hour. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 12/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the update afonso.rodrigues@maxmilhas.com.br. No need to manually collect serverStatus. Please upload when possible:
Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 12/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, Thank's for your answer. I change the value in the parameter "net.ipv4.tcp_mem" from "65536 131072 262144" to "5888262 7851019 11776524". I get the value from the other machine with SO Amazon Linux with same instance type (comparison in attached). After change value, this error "TCP: out of memory – consider tuning tcp_mem" not received again.
But the alternative resolution of problem was to restart all applications. There was no need to restart mongodb process. I will attach the log files later in the day. Is it valid for me to put some process collecting information server status and lock connections? would be this commands:
Thank you so much. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 11/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi afonso.rodrigues@maxmilhas.com.br, thank you for uploading all the requested information, it was very helpful. I've correlated the occurrences of "serverStatus was very slow" messages with the events on the server. There's a moderately high rate of packet loss: up to 20-30 RetransSegs, TCPLossProbes, and DelayedACKLost per second. so I suspect that the network packet loss could be the reason for this issue.
I am not sure if your are simply hitting the limit of your network bandwidth or if those "TCP out of memory" errors (unfortunately we do not have dmesg.log (human version) post Dec10 03:24 to correlate with mongoD.log which only starts at Dec10 10:50), or both are related to this issue. Could you investigate these network issues and come back with your founding? I am eager to re-evaluate these slow messages once the network issue is addressed. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 11/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you so much for waiting. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 10/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I will the upload file diagnostic-data later. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 10/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I collecting and make the upload in the portal 10gen-httpsupload.s3.amazonaws.com. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 10/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, My problem relationship about MongoDB does not define if is a Bug or Performance issue. The symptoms I receive: In this moment the incident, the workload in the database don't change. In command mongostat I see the columns "insert query update delete" with *0. I attached the graphs for incident in 08/12 between 10:00 to 10:44 UTC-03. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 10/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi afonso.rodrigues@maxmilhas.com.br, In order to better understand what's going on we'll need to collect some addition information. Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? In addition, please upload dmesg, syslog and messages logs. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 10/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi afonso.rodrigues@maxmilhas.com.br, As per our documentation, this is the time, in microseconds, since the database last started and created the globalLock. This is roughly equivalent to total server uptime. Could you please elaborate what MongoDB bug you might be pointing to in this ticket? Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 10/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Dima, After many operations with messages "serverStatus was very slow", will the accumulate operations in my server this cause the stack operations? This example is the first message "serverStatus was very slow" after restart mongodb:
The numbers "after globalLock: 7420" the reference a number of document is impacted or in locking? My version: | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 10/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi afonso.rodrigues@maxmilhas.com.br, This is not an error, the serverStatus command is being run at a regular interval to collect statistics about the instance. If the response of this command is being slow, we log the output listing all the important MongoDB sub-components and the time it took. In your example, the first non-zero component is globalLock
You can correlate operations inside the mongodb.log executed at this time to understand which operations were responsible for taking a globalLock. I would definitely suggest addressing this device driver level error:
I am not sure there is a relationship between the TCP: out of memory errors and outputs of serverStatus as timestamps between these two do not match. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 09/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
What the important files for analysis? How to find the root causes for "[ftdc] serverStatus was very slow"
Thank's | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Afonso Rodrigues [ 09/Dec/19 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
In my server, is a single instance with disks in raid 0 for performance.
In my mongod.log I recive the messages:
After restart the server, is problem solves temporarily.
|