[SERVER-49816] mongod server crashed while mongodump was running Created: 22/Jul/20 Updated: 27/Oct/23 Resolved: 28/Jul/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability |
| Affects Version/s: | 4.0.13 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Dmitry Agranat |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
We are running a replicaSet consisting of 3 bare metal servers. The primary went down while mongodump was executing (daily backup).
We have saved mongod logs and diagnostic data from all 3 replSet members for your investigation. Here are the last lines of the mongod log, which seems not very helpful - at least for us:
Both mongod and mongodump are of version 4.0.13. |
| Comments |
| Comment by Kay Agahd [ 03/Aug/20 ] | |||||||||||||||
|
Hi dmitry.agranat, stating "works as designed" is quite surprising when mongodump impacts performance so badly that the server is marked as DEAD by itself during the dump. | |||||||||||||||
| Comment by Dmitry Agranat [ 03/Aug/20 ] | |||||||||||||||
|
Hi kay.agahd@idealo.de, this ticket is in status "Closed" as we've determined the incident was related to your OS/HW issues. As for the latest issue, mongodump can impact the performance of your running database and this is expected. As mentioned earlier, there are other backup methods (including different approaches with mongodump) which you might find more suitable for your cluster. The SERVER project is for bugs and feature suggestions for the MongoDB server. If you need further assistance troubleshooting, I encourage you to ask our community by posting on the MongoDB Community Forums or on Stack Overflow with the mongodb tag. | |||||||||||||||
| Comment by Kay Agahd [ 03/Aug/20 ] | |||||||||||||||
|
Hi dmitry.agranat, we had the same issue on Sunday, 2020-08-02. We limited mongodump by adding the parameters:
However, mongodump produced so much load, that we couldn't even connect to the primary. It failed with:
As soon as we were able to connect to the Primary, we stepped it down:
Database clients could then recover but mongodump was interupted and terminated, so we don't have a complete backup of this day. I've uploaded again log and diagnostic files for your investigation. | |||||||||||||||
| Comment by Kay Agahd [ 01/Aug/20 ] | |||||||||||||||
|
Hi dmitry.agranat, the same server had again severe problems while mongodump was running. This time, mongod did not crash but it was so slow that it has been declared DEAD and stepped down to Secondary. This time, there were no mongodump related error messages in the syslog like the last time:
However, this time we have more verbose mongod logs than the last time but all I see is that many queries were very slow when mongodump started (2020-08-01 10:05:01 Germany/Berlin). Did mongodump block them all? However, other mongodb replSet's having the same hardware and configuration don't have this problem when mongodump is running. mongodump was connected to mongo-india01-03.db00.pro07. I've uploaded all mongod log and diagnostic files of this server to the upload location associated to this jira ticket. I'll upload the same for the other 2 mongodb servers belonging to the same replSet. If you prefer that I open a new ticket to follow up this issue, just tell me please. Thanks for your investigation! | |||||||||||||||
| Comment by Dmitry Agranat [ 28/Jul/20 ] | |||||||||||||||
|
Sounds good kay.agahd@idealo.de, I will go ahead and close this case as "Works as expected" as we did not find any MongoDB's fault + your mention of faulty infrastructure. If this happens again, apart from the usual data, please also grab messages, syslog, dmesg logs. Regards, | |||||||||||||||
| Comment by Kay Agahd [ 27/Jul/20 ] | |||||||||||||||
|
Hi Dima, thanks for your suggestions (all known already though). | |||||||||||||||
| Comment by Dmitry Agranat [ 27/Jul/20 ] | |||||||||||||||
|
For the sake of this investigation, let's start by not suppressing MongoDB logs (by removing the quiet:true option), we might need to increase log level during this investigation based on the data we'll see in unsuppressed logs. For the fresh data collected, please mention timestamps of mongodump start. As for a best backup approach for your cluster/workload, I invite you to ask our community by posting on the MongoDB Community Forums or on Stack Overflow with the mongodb tag. Just to mention a few points about mongodump which you might want to consider:
Thanks, | |||||||||||||||
| Comment by Kay Agahd [ 26/Jul/20 ] | |||||||||||||||
|
Hi Dima, our network maximum throughput is 100-Gbit/s. In praxis, our backup server writes more than 600 MiB/sec to disk. Thanks for the pointing out the quiet:true option. I think this has been set because the logs grew too fast, especially due to connection attempts. We know that mongodb allows to set different log levels for different components but if we decrease some of them, it might be that exactly these will be required for bug analysis. Can you suggest a well balanced settings for logging? Concerning mogodump, what else do you suggest to do daily backup? Do you suggest only paid (cloud) services such as MongoDB Atlas, Cloud Manager or Ops Manager? If it limits to these, mongodb develops in the wrong direction as Oracle did at the time. Time to look for other alternatives then. | |||||||||||||||
| Comment by Dmitry Agranat [ 26/Jul/20 ] | |||||||||||||||
|
I had a look at the uploaded data. It looks like you have experienced some sort of a network issue during this event, for example:
Could you clarify what is your network maximum throughput? However, since you use quiet:true option, it is not really possible to track down this issue. As per our documentation:
In addition, could you also elaborate what is the reason you are running mongodump as your backup method? By design, mongodump will push the working set out of memory, the same working set you will need to read back right away. | |||||||||||||||
| Comment by Kay Agahd [ 23/Jul/20 ] | |||||||||||||||
|
Hi Dima, I uploaded mongod log and diagnostic files of all 3 servers to the upload loacation.
The hostname of the member where mongodump was running is mongo-india01-03. It was primary at the time. After the crash, mongo-india01-02 took over and is still primary. Thanks for your investigation. | |||||||||||||||
| Comment by Dmitry Agranat [ 23/Jul/20 ] | |||||||||||||||
|
I've created a secure upload portal for you. I have one question at this point, on which member mongodump was running (plus mention the start time of mongodump + timezone). Thanks, | |||||||||||||||
| Comment by Kay Agahd [ 22/Jul/20 ] | |||||||||||||||
|
It could be related to our backup server which is a NFS mounted device. I found this:
Source: https://helpful.knobs-dials.com/index.php/INFO:_task_blocked_for_more_than_120_seconds. However, this should not crash the server, shouldn't it? |