[SERVER-27153] Mongodump on secondaries causes performance issue with w:majority Created: 22/Nov/16 Updated: 21/Feb/17 Resolved: 21/Feb/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.2.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Luke Morfitt | Assignee: | Kelsey Schubert |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Production |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Hello, We are experiencing performance issues on the master when mongodump backs up on secondaries. Out applications all perform read / writes from the master. The slaves are only used for replica set and backups. Our backups are performed at 8am / 8pm every day and can take a considerable amount of time to run on the slaves. Backups are run using the mongodump command as follows for each DB on the cluster. Stopping the mongodump process from running on the secondaries resolves our masters performance issues. So far we have seen this issue on two of our mongo clusters. Please see details for one of the clusters we have seen this issue with. We usually run a larger master to the slaves to reduce cost as follows. gsp-aeu001-mo04 (M) r3.2xlarge At the high level we can see at 8am/pm the number of commands/querys/updates etc drop off for a period of time whilst the backups run. This then causes the java applications connecting to the master to experience queuing threads. At the same time as the commands fluctuate the performance of all three servers look as follows. It appears that the performance of the secondaries backing up causes a degradation in performance on the master. We have gathered the "/var/lib/mongodb/diagnostic.data/" data for each of the servers. Do you have a none public place for us to upload this data please? Increasing the size of the Mongo secondaries to match the master has improved the performance, however we would like to understand why backups on secondaries cause performance issues on the master. Kind regards |
| Comments |
| Comment by Kelsey Schubert [ 21/Feb/17 ] |
|
Hi lukemorfitt, I'm closing this ticket in favor of our investigation on Kind regards, |
| Comment by Luke Morfitt [ 06/Dec/16 ] |
|
Correct. We split the number of databases we are hosting across the secondaries available and perform backups at 8am / 8pm. We aim to get a backup of every database every 12 hours. In order to do that we need to backup on both secondaries at the same time as our backups are rather time consuming. They would not complete on a single secondary within a 12 hour period. Whilst this may not be the usual use case this does not explain the global level locking. I would appreciate understanding the global level locking during the cache eviction process. Thanks, |
| Comment by Asya Kamsky [ 05/Dec/16 ] |
|
You say "both of the secondaries are performing backups" - more common would be to use one of them for backups only and let the other one be the "hot standby" for the primary (and be the one available to acknowledge w:majority writes. |
| Comment by Luke Morfitt [ 05/Dec/16 ] |
|
Hello, The current theory I have is as follows.
Increasing the size of the secondaries(providing more cache space) does not fix the issue as they still hit more than 85% cache usage and start to lock. The impact is reduced, however still apparent. Any thoughts on this matter would be greatly appreciated. Thanks, |
| Comment by Luke Morfitt [ 05/Dec/16 ] |
|
I've noticed the following whilst trying to diagnose this issue. Upon running a mongodump with the following command we see locks appear on the secondary. "/usr/bin/mongodump --host=localhost --gzip --db=DB_NAME --excludeCollection=requestsByUserState --out=./DB_NAME-vv" At the same time the replication Commands on the master reduce. I would like to understand the cause of the locking taking place. Is it caused by the loading of data into the wiredTiger cache (as the cache gets to 95% and then loads more from disk). Or is it caused by a requirement to lock the collection during a dump? Thanks |
| Comment by Luke Morfitt [ 23/Nov/16 ] |
|
Hi, I've uploaded the files from the servers. Whilst we kick off the backups at 8am each day, I think the issue starts to appear when we start performing backups of the larger databases on the cluster. Thus the issue appears an hour later then when the backups start. Stopping the backups did resolve the performance issue, which is why the issue appears(may not be) to be when mongodump is running. On another cluster we see the issue appear exactly on 8am/8pm. Thanks |
| Comment by Ramon Fernandez Marina [ 23/Nov/16 ] |
|
Thanks for uploading the diagnostic.data lukemorfitt, we're looking at it. Can you please also upload the compressed mongod logs for all three nodes? The commands graph shows pattern changes at 9pm and 9am, not 8am/8pm; not sure if this is an issue with cacti, a typo in the description, or just a coincidence – but the logs should be able to tell us more about what's going on. |
| Comment by Luke Morfitt [ 22/Nov/16 ] |
|
Hello Thomas, Thank you for the quick response. I've uploaded the files via Curl. To confirm mongodump runs on both secondaries, and never on the master. Thanks, |
| Comment by Kelsey Schubert [ 22/Nov/16 ] |
|
Hi lukemorfitt, Thanks for reporting this issue. I've created a secure portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating the issue and are routinely deleted after some time. Please upload the diagnostic.data for the primary and secondary running mongodump. Thank you, |