[SERVER-49816] mongod server crashed while mongodump was running Created: 22/Jul/20  Updated: 27/Oct/23  Resolved: 28/Jul/20

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 4.0.13
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Dmitry Agranat
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

We are running a replicaSet consisting of 3 bare metal servers. The primary went down while mongodump was executing (daily backup).
The mongodump command was as follows whereas $HOSTNAME stands for the FQDN of the Primary:

mongodump -h $HOSTNAME --port $PORT -u${USER} -p${PASS} --authenticationDatabase=$AUTHDATABASE -o $BACKUPDIR 

We have saved mongod logs and diagnostic data from all 3 replSet members for your investigation.
Where can we upload them?

Here are the last lines of the mongod log, which seems not very helpful - at least for us:

2020-07-22T18:53:55.193+0200 I COMMAND  [conn56968] command ipc-catalog.productOfferInfo command: find { find: "productOfferInfo", filter: { _i
d: { _id: 3127816, locale: "de_DE" } }, limit: 1, singleBatch: true, $db: "ipc-catalog", $clusterTime: { clusterTime: Timestamp(1595436345, 1),
 signature: { hash: BinData(0, C48B0FBAAB6D9151CD788358E80382121C11A9E5), keyId: 6849703745916239873 } }, lsid: { id: UUID("629864ef-f24d-4147-
89dd-a6c24254de42") }, $readPreference: { mode: "nearest" } } planSummary: IDHACK numYields:0 ok:0 errMsg:"Executor error during find command :
: caused by :: operation was interrupted" errName:InterruptedDueToReplStateChange errCode:11602 reslen:308 locks:{ Global: { acquireCount: { r:
 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_msg 460023ms
2020-07-22T18:53:55.193+0200 I REPL     [replexec-1968] Member mongo-india01-02.db00.pro06.eu.idealo.com:27017 is now in state RS_DOWN
2020-07-22T22:06:29.715+0200 I CONTROL  [main] ***** SERVER RESTARTED *****

Both mongod and mongodump are of version 4.0.13.
Each server is running on Linux/Stretch and has 56 CPUs, 384 GB RAM, 5.8 TB SSD in Raid 10.



 Comments   
Comment by Kay Agahd [ 03/Aug/20 ]

Hi dmitry.agranat,

stating "works as designed" is quite surprising when mongodump impacts performance so badly that the server is marked as DEAD by itself during the dump.
But if mongodb.org is not willing to fix and not even to investigate it (because you probably want to sell your expensive backup solution), I don't see a reason to insist any longer.

Comment by Dmitry Agranat [ 03/Aug/20 ]

Hi kay.agahd@idealo.de, this ticket is in status "Closed" as we've determined the incident was related to your OS/HW issues. As for the latest issue, mongodump can impact the performance of your running database and this is expected. As mentioned earlier, there are other backup methods (including different approaches with mongodump) which you might find more suitable for your cluster.

The SERVER project is for bugs and feature suggestions for the MongoDB server. If you need further assistance troubleshooting, I encourage you to ask our community by posting on the MongoDB Community Forums or on Stack Overflow with the mongodb tag.

Comment by Kay Agahd [ 03/Aug/20 ]

Hi dmitry.agranat,

we had the same issue on Sunday, 2020-08-02. We limited mongodump by adding the parameters:

  • --numParallelCollections 1
  • --forceTableScan

However, mongodump produced so much load, that we couldn't even connect to the primary. It failed with:

 PRO [10:59][root@mongo-india01-02.db00.pro06:/home/enrico.rabeneck]# ✘ mongo
MongoDB shell version v4.0.13
connecting to: mongodb://127.0.0.1:27017/?gssapiServiceName=mongodb
2020-08-02T10:59:37.278+0200 E QUERY    [js] Error: couldn't connect to server 127.0.0.1:27017, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27017 :: caused by :: Connection timed out :
connect@src/mongo/shell/mongo.js:344:17
@(connect):2:6
exception: connect failed

As soon as we were able to connect to the Primary, we stepped it down:

india01:PRIMARY> rs.printSlaveReplicationInfo()
source: mongo-india01-01.db00.pro05.eu.idealo.com:27017
	syncedTo: Sun Aug 02 2020 11:06:48 GMT+0200 (CEST)
	1 secs (0 hrs) behind the primary 
source: mongo-india01-03.db00.pro07.eu.idealo.com:27017
	syncedTo: Sun Aug 02 2020 11:06:49 GMT+0200 (CEST)
	0 secs (0 hrs) behind the primary 
india01:PRIMARY> rs.stepDown()


Database clients could then recover but mongodump was interupted and terminated, so we don't have a complete backup of this day.

I've uploaded again log and diagnostic files for your investigation.
Due to these performance issues, we are unable to backup this replica set.
Any help is greatly appreciated.

Comment by Kay Agahd [ 01/Aug/20 ]

Hi dmitry.agranat,

the same server had again severe problems while mongodump was running. This time, mongod did not crash but it was so slow that it has been declared DEAD and stepped down to Secondary. This time, there were no mongodump related error messages in the syslog like the last time:

Jul 22 18:56:19 mongo-india01-03 kernel: [23647675.428924] INFO: task mongodump:274200 blocked for more than 120 seconds.
Jul 22 18:56:19 mongo-india01-03 kernel: [23647675.428945]       Not tainted 4.9.0-11-amd64 #1 Debian 4.9.189-3
Jul 22 18:56:19 mongo-india01-03 kernel: [23647675.428959] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 22 18:56:19 mongo-india01-03 kernel: [23647675.428976] mongodump       D    0 274200 273730 0x00000080


However, this time we have more verbose mongod logs than the last time but all I see is that many queries were very slow when mongodump started (2020-08-01 10:05:01 Germany/Berlin). Did mongodump block them all? However, other mongodb replSet's having the same hardware and configuration don't have this problem when mongodump is running. mongodump was connected to mongo-india01-03.db00.pro07.

I've uploaded all mongod log and diagnostic files of this server to the upload location associated to this jira ticket. I'll upload the same for the other 2 mongodb servers belonging to the same replSet.

If you prefer that I open a new ticket to follow up this issue, just tell me please.

Thanks for your investigation!

Comment by Dmitry Agranat [ 28/Jul/20 ]

Sounds good kay.agahd@idealo.de, I will go ahead and close this case as "Works as expected" as we did not find any MongoDB's fault + your mention of faulty infrastructure.

If this happens again, apart from the usual data, please also grab messages, syslog, dmesg logs.

Regards,
Dima

Comment by Kay Agahd [ 27/Jul/20 ]

Hi Dima,

thanks for your suggestions (all known already though).
So I guess your investigation ends here due to the low log level. I understand this, it's not a problem.
I think the reason of the crash was our NFS mounted backup server, which did not respond during more than 120 seconds. However, this should not crash mongodb nor the server itself. Let's assume this was an exception which shouldn't have happened.
If it should happen again, we'll have more verbose log to let you investigate the case.

Comment by Dmitry Agranat [ 27/Jul/20 ]

Hi kay.agahd@idealo.de,

For the sake of this investigation, let's start by not suppressing MongoDB logs (by removing the quiet:true option), we might need to increase log level during this investigation based on the data we'll see in unsuppressed logs. For the fresh data collected, please mention timestamps of mongodump start.

As for a best backup approach for your cluster/workload, I invite you to ask our community by posting on the MongoDB Community Forums or on Stack Overflow with the mongodb tag. Just to mention a few points about mongodump which you might want to consider:

  • mongodump will be efficient for backing up small data sets and is not ideal for capturing backups of larger systems (wherein data and indexes do not fit entirely into memory).
  • mongodump and mongorestore operate by interacting with a running mongod instance, hence, they can impact the performance of your running database.
  • mongodump only captures the documents in the database in its backup data and does not include index data. mongorestore requires that mongod then rebuild the indexes after restoring data.
  • If there is any activity going on the database between when the mongodump and when the mongorestore is performed, the new updates to the mongod will need to be accounted for on the new setup.

Thanks,
Dima

Comment by Kay Agahd [ 26/Jul/20 ]

Hi Dima,

our network maximum throughput is 100-Gbit/s. In praxis, our backup server writes more than 600 MiB/sec to disk.

Thanks for the pointing out the quiet:true option. I think this has been set because the logs grew too fast, especially due to connection attempts. We know that mongodb allows to set different log levels for different components but if we decrease some of them, it might be that exactly these will be required for bug analysis. Can you suggest a well balanced settings for logging?

Concerning mogodump, what else do you suggest to do daily backup? Do you suggest only paid (cloud) services such as MongoDB Atlas, Cloud Manager or Ops Manager? If it limits to these, mongodb develops in the wrong direction as Oracle did at the time. Time to look for other alternatives then.

Comment by Dmitry Agranat [ 26/Jul/20 ]

Hi kay.agahd@idealo.de,

I had a look at the uploaded data. It looks like you have experienced some sort of a network issue during this event, for example:

Error sending response to client: SocketException: Broken pipe. Ending connection from

Could you clarify what is your network maximum throughput?

However, since you use quiet:true option, it is not really possible to track down this issue. As per our documentation:

systemLog.quiet is not recommended for production systems as it may make tracking problems during particular connections much more difficult.

In addition, could you also elaborate what is the reason you are running mongodump as your backup method? By design, mongodump will push the working set out of memory, the same working set you will need to read back right away.

Comment by Kay Agahd [ 23/Jul/20 ]

Hi Dima,

I uploaded mongod log and diagnostic files of all 3 servers to the upload loacation.
The start of mongodump was 18:45, timezone Berlin/Germany. I think that is the first log entry related to mongodump:

2020-07-22T18:45:02.401+0200 I COMMAND  [conn57730] command ipc-catalog.product command: getMore { getMore: 139689230911, collection: "product", $db: "ipc-catalog" } originatingCommand: { find: "product", hint: { _id: 1 }, skip: 0, $readPreference: { mode: "secondaryPreferred" }, $db: "ipc-catalog" } planSummary: IXSCAN { _id: 1 } cursorid:139689230911 keysExamined:8946 docsExamined:8946 numYields:69 nreturned:8945 reslen:16776504 locks:{ Global: { acquireCount: { r: 70 } }, Database: { acquireCount: { r: 70 } }, Collection: { acquireCount: { r: 70 } } } storage:{ data: { bytesRead: 107712716, timeReadingMicros: 202868 } } protocol:op_query 254ms


The hostname of the member where mongodump was running is mongo-india01-03. It was primary at the time.
After the crash, mongo-india01-02 took over and is still primary.
Thanks for your investigation.

Comment by Dmitry Agranat [ 23/Jul/20 ]

Hi kay.agahd@idealo.de,

I've created a secure upload portal for you.

I have one question at this point, on which member mongodump was running (plus mention the start time of mongodump + timezone).

Thanks,
Dima

Comment by Kay Agahd [ 22/Jul/20 ]

It could be related to our backup server which is a NFS mounted device. I found this:

NFS seems to be a common culprit, probably because it's good at filling the writeback cache, something which implies blocking while writeback happens - which is likely to block various things related to the same filesystem.

Source: https://helpful.knobs-dials.com/index.php/INFO:_task_blocked_for_more_than_120_seconds.

However, this should not crash the server, shouldn't it?

Generated at Thu Feb 08 05:20:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.