[SERVER-40433] Mongodump extremely slow on 4.0.8 Created: 02/Apr/19 Updated: 20/Aug/19 Resolved: 20/Aug/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.0.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Paul Henderson | Assignee: | Danny Hatcher (Inactive) |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Sprint: | Storage Engines 2019-07-01 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Story Points: | 3 | ||||||||||||||||||||||||
| Description |
|
Since upgrading to mongo 4.0.8 from 3.4.6 our mongodumps have become extremely slow. mongodump --archive=${DIR}.gz \ is the command we run for backups. Since the upgrade we are seeing about 10x slower backups. |
| Comments |
| Comment by Danny Hatcher (Inactive) [ 20/Aug/19 ] |
|
Ah I apologize for the confusion! Thanks for letting us know and for your patience while we fixed the problem. |
| Comment by Devon Yang [ 19/Aug/19 ] |
|
Sorry if it wasn't clear, I can confirm the issue is resolved for us in 4.0.12 on behalf of Paul. |
| Comment by Danny Hatcher (Inactive) [ 19/Aug/19 ] |
|
paulblend, as Devon mentioned, we implemented the fix we were waiting for in 4.0.12. Could you please try upgrading a Secondary and see if performance is improved? devony, I can't speak to |
| Comment by Devon Yang [ 19/Aug/19 ] |
|
Just wanted to let you know this is resolved in 4.0.12.
I have a question related to eviction that maybe can be answered quickly here. Is there a reason that the WT file_manager close_idle_time is still set to 100000 despite https://jira.mongodb.org/browse/WT-1930 having merged to support a 0 value. (https://jira.mongodb.org/browse/SERVER-18286). In https://jira.mongodb.org/browse/SERVER-40773 we were seeing large dhandle sweeps at this timeout which we think might be mitigated if the 0 setting is safe to use. |
| Comment by Danny Hatcher (Inactive) [ 26/Jun/19 ] |
|
paulblend, I apologize for the confusion surrounding the status of this ticket. The workload that you've shown is concerning and I'd like to ensure that we've taken concrete steps to address it before closing this ticket. We've recently made some changes to the WiredTiger storage engine that may help alleviate the problem. I'm following up internally to see about getting those changes backported to a 4.0 build so that you can see if they help your issue. |
| Comment by Haribabu Kommi [ 24/Jun/19 ] |
|
The problem is happening both before and after the mongodump. Due to the read heavy load, the cache eviction server is not able to clean the cache, and it leads to stalling. |
| Comment by Paul Henderson [ 20/Jun/19 ] |
|
The mongo dataase has 271 dbs in it, all of which are hurt by the issue. I took db.stats() from 2 larger ones and 2 smaller ones. |
| Comment by Dmitry Agranat [ 20/Jun/19 ] |
|
Hi devony, Could you please provide the output of dbStats() for the relevant database? This will provide us some more information of how much data the mongodump is trying to access. Thank you, |
| Comment by Dmitry Agranat [ 30/May/19 ] |
|
Hi devony, thank you for uploading new data, we will analyze it and report back with next steps. In the meantime, regarding your latest question:
As long as this not a sharded cluster and compatibility mode was not set to 4.0, there should not be any side effect with having one node on 3.6 version until we sort this out. Thank you, |
| Comment by Devon Yang [ 30/May/19 ] |
|
Hi @Dima, I've uploaded the relevant diagnostics files through the secure portal. The data from the affected timeframe should be in metrics.2019-05-29T18-55-39Z-00000 I want to also double check that there are no known side-effects to running a primary and replica as a 4.0.9 while there is also secondary on 3.6.6. Obviously we will not set compatibility mode to 4.0 until this is resolved. |
| Comment by Dmitry Agranat [ 30/May/19 ] |
|
Hi paulblend, Thank you for the update. Could you please upload the diagnostic data covering the latest test you did after upgrading your node to 3.6.10? This information would help us to determine the next steps. Leaving one node on 3.6 until we'll sort this out should not be an issue. Thank you, |
| Comment by Paul Henderson [ 29/May/19 ] |
|
Would there be any problems upgrading 2 nodes to 4.0.x but leaving one as 3.6.x while the backup issue is sorted out?
As for the above, I went with option 1 kind of. I upgraded our 3.6.6 machine to 3.6.10. The backup problem came back. I downgraded the server back to 3.6.6. and the problem went away again. |
| Comment by Dmitry Agranat [ 28/May/19 ] |
|
Hi paulblend, We are still investigating this issue. Currently, one of the suspects is SERVER-40925 or/and the change that was made in
Please chose one of the above options, whichever is more convenient for you. Once you have repeated the same workload after the proposed change, please upload the new set of requested data into the secure upload portal. Thank you, |
| Comment by Paul Henderson [ 16/May/19 ] |
|
sent |
| Comment by Danny Hatcher (Inactive) [ 16/May/19 ] |
|
Just 4.0.9 should be sufficient. |
| Comment by Paul Henderson [ 16/May/19 ] |
|
Havn't had a chance, will get to it today. Do you need the 3.6.12 versions again just the 4.0.9 ones? |
| Comment by Danny Hatcher (Inactive) [ 16/May/19 ] |
|
Hello Paul, Have you had a chance to send clean logs? Having the comparison would be extremely helpful. |
| Comment by Paul Henderson [ 09/May/19 ] |
|
Actually disregard and delete those files if possible. I'll resend new logs in a bit. |
| Comment by Paul Henderson [ 09/May/19 ] |
|
Timestamps for when the backups were run against 4.0.9 were 10:45am-noon PST 5/9 |
| Comment by Paul Henderson [ 09/May/19 ] |
|
I uploaded backup logs and diagnostic data from 4.0.9 and reuploaded the 3.6.6. and 3.6.12 |
| Comment by Paul Henderson [ 08/May/19 ] |
|
I'll get around to that sometime in the next day or so and upload the files. |
| Comment by Danny Hatcher (Inactive) [ 08/May/19 ] |
|
Paul, That would be a big help! Danny |
| Comment by Paul Henderson [ 08/May/19 ] |
|
Do you need me to reupgrade a node to 4.0 to test the backups there and get the logs/diagnostics data again? |
| Comment by Sulabh Mahajan [ 07/May/19 ] |
|
I had a quick look at the FTDC data from 3.6.12-bad. Here are my observations:
Further investigation is needed to figure out what operation caused this 10-second stall. I could not locate FTDC from 4.0, it will be worth looking at that to make sure 4.0 suffered from the same issue. |
| Comment by Danny Hatcher (Inactive) [ 30/Apr/19 ] |
|
Hello Paul, It does appear that this is an issue with eviction but I'm not sure yet why one node is worse. The info you've provided should be sufficient for me to dig into for now. Thanks, Danny |
| Comment by Paul Henderson [ 25/Apr/19 ] |
|
Having issues getting exact logs. The bad runs happened between 12:30-13:30 PST yesterday, the good runs happened post 13:30 PST. Having problems finding better time stamps. If you need them I can resend files from today, this time with a mongo log failing to provide better timestamps. |
| Comment by Danny Hatcher (Inactive) [ 25/Apr/19 ] |
|
Hello Paul, Thanks for the logs. Do you have timestamps for your runs that I can use to match with the logs and diagnostics? The output you provided in the server40433* files was very helpful before and would be great to have the equivalent here. Danny |
| Comment by Paul Henderson [ 23/Apr/19 ] |
|
I'll rerun later today or tomorrow, right now 1 is on 3.6.12 and the other is on 3.6.6
|
| Comment by Danny Hatcher (Inactive) [ 23/Apr/19 ] |
|
Hello Paul, Actually, if both of your secondaries are the same size that is an ideal comparison. What versions are they at this point? If one is pre-3.6.12 and one is after, could you run the tests again? Danny |
| Comment by Paul Henderson [ 23/Apr/19 ] |
|
Thanks for looking. We have had a couple wrinkles come up. The only reason we have a secondary that is so much larger is because the issue was occurring when all the machines were the same size. Increasing the secondary size was a kinda hail mary which I guess might have made things worse. We have since then moved the large machine to be primary and have 2 same sized small machines as a secondary. I can decrease the wiredTigerCacheSize on the primary to test but more likely we will take a few days to upgrade the small machines to all be the same size as the larger one so all caches will be the same large value. It turns out we also have this same issue when upgrading from 3.6.6. to 3.6.12, our backups run fine on 3.6.6 but have the same slowness issue on 3.6.12 that we were seeing on 4.0.8. |
| Comment by Danny Hatcher (Inactive) [ 22/Apr/19 ] |
|
Hello Paul, Apologies for the delay. Looking into the situation for your 4.0 test, I can see that within 3 minutes of the mongodump starting your WiredTiger cache reached 95% capacity. When the cache takes up that much space, it starts to use application threads that would otherwise be servicing queries to help evict. It dramatically slows down your server which is why the 4.0 mongodump takes so long. The question is why did your cache rise from 80% (normal) to 95% (very high) so quickly on 4.0 but not on 3.6. I see that your 4.0 server is configured with twice the amount of RAM as compared to your 3.6 server thus twice the WiredTiger cache size. We have seen in the past issues with eviction when the available cache for a node is very high. Would it be possible for you to set the wiredTigerCacheSize parameter for the larger secondary down to match the default value (61GB) of the other secondary? This may negatively impact the performance of your cluster if you are running a large amount of secondary reads so you should do this during a maintenance period or one of low load. Thanks, Danny |
| Comment by Paul Henderson [ 17/Apr/19 ] |
|
Any updates? |
| Comment by Paul Henderson [ 05/Apr/19 ] |
|
Let me know if there is anything else I can provide that would be helpful in debugging. |
| Comment by Paul Henderson [ 02/Apr/19 ] |
|
Thanks for looking. I uploaded the metrics files from diagnostic.data and logs for mongodb3 and mongodb4. I prefixed the files with server40433. |
| Comment by Danny Hatcher (Inactive) [ 02/Apr/19 ] |
|
Hello Paul, Thanks for your report. In order for us to diagnose, please attempt the mongodump against both the 4.0 and the 3.6 instances and provide the resulting mongod logs plus "diagnostic.data" folders under the {{$dbpath}}s. You can upload the files to our Secure Upload Portal. Please rest assured that only MongoDB engineers will be able to see the files and they will delete automatically after 90 days. Thanks, Danny |
| Comment by Paul Henderson [ 02/Apr/19 ] |
|
Another note, only one of our mongo instances is on 4.0 whereas the others are on 3.6. When we point the backups to instance on 4.0 they run slow, repointing them to a secondary on 3.6 and they speed back up. |