[SERVER-25760] Mongodump taking extraordinarily long, utilizing almost zero resources, yet slowing down server Created: 23/Aug/16  Updated: 01/Feb/17  Resolved: 01/Feb/17

Status: Closed
Project: Core Server
Component/s: Performance, Tools
Affects Version/s: 3.2.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Chad Kreimendahl Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File MongoDumpCausingMongodMemoryConsumption.png    
Issue Links:
Duplicate
duplicates SERVER-25070 Significant delays when large collect... Closed
Operating System: ALL
Steps To Reproduce:

MongoDB in 3x replicaset, with standard configuration.

From the 3rd system (SECONDARY):
mongodump -v -j 24 -o /backups/mongoRepl1

Participants:

 Description   

When performing a basic mongodump on our database, with settings seen below, on high performance hardware, we're seeing very long delays between writes, and nearly zero resources on the system being utilized, but significantly increased response time to even the most simple of commands.

CPU utilization remains the same as other systems. iostat shows nearly no activity, similar to other systems. Network load remains the same.

The time it takes across NFS is just as long, and more easily shows the prolonged gaps in activity. You will see random peaks of full bandwidth usage lasting a few seconds, followed by several seconds, sometimes minutes of zero activity.



 Comments   
Comment by Kelsey Schubert [ 01/Feb/17 ]

Hi sallgeud,

Thanks for confirming you haven't encountered SERVER-25070 since upgrading to MongoDB 3.2.10. My understanding is that this issue has been resolved as well, so I'm updating the ticket. Please let me know if that isn't the case.

Thank you,
Thomas

Comment by Kelsey Schubert [ 04/Oct/16 ]

Hi sallgeud,

Thanks for the additional information. I'm going to put this ticket on hold, pending our investigation of SERVER-25070 since it is likely the root cause. Please note that MongoDB 3.2.10 is available and ready for download.

Let us know if it resolves the issues you have been observing.

Thank you,
Thomas

Comment by Chad Kreimendahl [ 21/Sep/16 ]

1. We have a monitoring tool that outputs it all to a pretty dashboard for us.
2. I've not yet tried this, but can.
3. We actually already do mongodump from two places, so that we have not only redundant data centers, but redundant recovery in each. We see the same issue in both. One is a regular secondary, the other is a hidden secondary. The hidden has half the memory, and is the one that runs into trouble more often. Since updating the hidden one to 3.2.9, we haven't seen it crash. However, we also went away from "-j 8" and just left it as a regular backup with no extra arguments.
4. For this specific database: Approximately 44 million objects across 92 databases. A total of about 90GB uncompressed data + indexes.
a) We typically try to keep the databases sized so that we can fit the entirety of customer data + indexes in memory. So depending on the database, that's between 100GB and 500GB of data.

Comment by Kelsey Schubert [ 06/Sep/16 ]

Hi sallgeud,

Thank you for the information, I have some follow up questions to help us continue to investigate:

  1. How are you measuring memory usage?
  2. Do you observe the same issue if you dump a single database?
  3. Are you able to execute mongodump against the secondary from a different machine?
  4. Do you have a rough estimate of total number of documents, and total size of your documents?

Thanks again,
Thomas

Comment by Chad Kreimendahl [ 28/Aug/16 ]

I can use mongodump to create the scenario you see in SERVER-25663 .

Of note is that the system on which mongodump is running goes from using a regular amount of memory (~55% of the total [64GB]), and though it continues to report only 55% usage, and no other processes are running, you end up with > 100% of physical memory needing to be allocated, resulting in swap usage.

I'm curious if this may be the underlying issue. Ultimately we're swapping on ultra-fast nvme ssd, but I'm not sure how exactly we got to swap.

I've got a great chart from our monitoring tools that shows it happening on the first backup post-restart.

Comment by Kelsey Schubert [ 26/Aug/16 ]

Hi sallgeud,

Would you please clarify whether this is the same issue that we are investigating in SERVER-25663?

Thank you,
Thomas

Comment by Chad Kreimendahl [ 23/Aug/16 ]

I forgot to mention in my description:

While the backup is running, simple commands such as "show dbs" take 10-40 times longer to respond.

Regardless of the threading setting, the same results are seen. -j 1, -j 4, -j 8, -j 16, -j 24 all tested.

Generated at Thu Feb 08 04:10:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.