[SERVER-35958] Big CPU load increase (×4) on secondary by upgrading 3.4.15 → 3.6.5 Created: 04/Jul/18 Updated: 25/Mar/19 Resolved: 25/Mar/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, WiredTiger |
| Affects Version/s: | 3.6.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Adrien Jarthon | Assignee: | Geert Bosch |
| Resolution: | Done | Votes: | 0 |
| Labels: | nonnyc, st-triage, storage-engines | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Steps To Reproduce: | Upgrade RS secondary from 3.4.15 to 3.6.5 and watch CPU usage |
||||||||||||||||||||||||
| Sprint: | Storage NYC 2019-02-25 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
I just tried upgrading one of my RS from 3.4.15 to 3.6.5, thinking as 4.0 is now released 3.6 might be mature/stable enough now, but I had a very bad surprise, the CPU load increase about 4 times on the secondary (I didn't try the primary of course, I don't want to loose my cluster yet). As you can see on the chart bellow, my mongo usage is very stable and the secondary usually stays at 5% of total server CPU on 3.4.15, as soon as I upgrade to 3.6.5 it jumps to 15-25%. The load stayed exactly the same on the primary, and so does the CPU usage.
When I saw that I thought: maybe I should try re-syncing the secondary to benefit from the improvement of 3.6, so I did this, the first time it filled the disk during the initial sync and crashed mongo so I tried again and the second time it worked, but the load is exactly the same. I also tried putting back 3.6.0 and 3.6.1 to see where the regression is, but both didn't boot for unknown reasons yet. Is this expected? are you aware of any change that would cause this huge regression? This is my mongo cloud manager account monitoring the RS if it helps: https://cloud.mongodb.com/v2/5012a0ac87d1d86fa8c22e64#metrics/replicaSet/5414cfabe4b0ce23e21b4b3b/overview |
| Comments |
| Comment by Geert Bosch [ 25/Mar/19 ] | |||
|
I went through all changes from 3.6.7 to 3.6.8, and cannot see any that might explain the performance improvement in this case. Anyway, glad to see that the issue has been resolved for you. | |||
| Comment by Adrien Jarthon [ 10/Nov/18 ] | |||
|
Hello, I just tried again the new version (3.6.8) on my secondary to see if it would improve thing for this matter, and I was glad to see it did: It's not obvious when I changed from 3.4.15 to 3.6.8 on this chart and it's good (I changed on Nov 8 @ 9am), my 11am spike load is a bit higher but nowhere near where it was in 3.6.5 to 3.6.7 and the load during the rest of day is now slightly lower, so this is now totally acceptable to me. Good job, the question now is how did you manage to fix this? because there were no update in this ticket nor in | |||
| Comment by Adrien Jarthon [ 24/Sep/18 ] | |||
|
Haha, I'm not sure I get how there can be such huge regressions in majors more than 6 months old and how this could sound ok to release a "stable" version if you know about it but oh well... At least it's tracked now. For the CPU load on the secondary well I would love to try 4.0.2 but I can't until my cluster is in 3.6 compatibility mode can I? And I can't really switch to 3.6 mode until it's stable so I'll have to wait until you fix 3.6 first. Or until it's back ported by I can't see BACKPORT-3280 so I don't know if it's deleted or private. | |||
| Comment by Bruce Lucas (Inactive) [ 24/Sep/18 ] | |||
|
Thanks for the update Adrien. Regarding the increased i/o you are seeing, this is an issue that we are aware of and are working on, tracked by
The flushes are occurring as much as 1000 times per second, about once per write operation, resulting in a comparable number of separate i/o operations at the disk level, whereas in 3.4 flushes were occurring during comparable periods at about 30-50 times per second. You can follow Regarding the CPU utilization issue you reported, we speculated previously that that could be partly due to Thanks, | |||
| Comment by Adrien Jarthon [ 18/Sep/18 ] | |||
|
Hello it's me again, after a SSD issue on my primary which killed the performance and thus my entire service as Mongo doesn't fallback to secondary in this case (unless with We can see a multiplication of the number of disk writes by about 6 times which caused in my setup an increase of the global Disk IO usage from 5-8% to 17-22%: This is of course under the exact same load: This machine is more powerful than my secondary so it doesn't struggle as much and the CPU stays about the same, but we can clearly that by upgrading to 3.6 I just lost 3/4 of my server capacity, which is totally unacceptable for a software upgrade which doesn't bring anything worth a 400% load increase in return. You should still have the diagnostic.data for my primary (web1) before the upgrade, I've just uploaded a diagnostic.data for today under 3.6.7 so you can compare. Before 3pm UTC there was some exceptional task running or replication stuff but after 3pm UTC (when the load settles down on the chart) it should be comparable to a previous day on 3.4. | |||
| Comment by Bruce Lucas (Inactive) [ 27/Jul/18 ] | |||
|
Thanks for the confirmation Adrien. We're continuing to look at these issues and will get back to you. | |||
| Comment by Adrien Jarthon [ 27/Jul/18 ] | |||
|
Indeed this chart shows pretty well what I'm noticing. About | |||
| Comment by Bruce Lucas (Inactive) [ 27/Jul/18 ] | |||
|
Here's a side-by-side comparison showing 3.4.15, 3.6.5 (maximum batch size 50,000), and 3.6.6 (at a batch size of 1000), showing about 14 hours, in each case including the daily update batch job.
This confirms that 3.6.6 with the reduced replication batch size has decreased the CPU utilization during the daily update job. However overall CPU utilization remains 2-3x that of 3.4.15, and is erratic. It appears that 3.6.6 may be worse than 3.6.5, so I wonder if this is related to | |||
| Comment by Bruce Lucas (Inactive) [ 26/Jul/18 ] | |||
|
Thanks bigbourin@gmail.com, we can see the increased and more erratic CPU utilization and are taking a look and discussing it internally. We'll get back to you with our findings. | |||
| Comment by Adrien Jarthon [ 25/Jul/18 ] | |||
|
Yes of course you can't compare the two, but you can compare the secondary with previous days running other version as the load is the same each day as said earlier. The spikes every hour on web1 are mongodumps, read only load that why it's not on web2. | |||
| Comment by Alexander Gorrod [ 25/Jul/18 ] | |||
|
bigbourin@gmail.com it's difficult to compare the web1 and web2 nodes you uploaded diagnostic data for, since one is a primary and the other a secondary. On the other hand - when I load the data it shows the node running 3.6.6 (secondary, called web2) as using consistently less CPU. The node running 3.4.9 has a spiky CPU usage pattern - but it increases along with an increase in queries that isn't seen on the secondary. The following picture shows the statistics: | |||
| Comment by Adrien Jarthon [ 24/Jul/18 ] | |||
|
Ok here is some feedback, so as said earlier I tried 3.6.6, and the tried with a batchsize of 1000 and then with 200 to see, here is the schedule of the changes: The diagnostic.data has been uploaded to the portal (though it doesn't goes back to 07/13 but you have that data in the previous upload. We can see 3.6.6 had little as said earlier, when changing to 1000 batch size, the impact is more visible, the load during the 11am spike is lower, though the load during the rest of the day seems less stable and a bit higher. decreasing even further to 200 batch has little to no visible impact IMO. We're still far from the low and stable load of 3.4.15 ☹ | |||
| Comment by Bruce Lucas (Inactive) [ 17/Jul/18 ] | |||
|
Thanks. The parameter isn't relevant to this issue because in 3.4.15 we didn't save history in cache during replication batch application like we do in 3.6. | |||
| Comment by Adrien Jarthon [ 17/Jul/18 ] | |||
|
Yep, will do! what was the value for this parameter in 3.4.15? or maybe it's not relevant? | |||
| Comment by Bruce Lucas (Inactive) [ 17/Jul/18 ] | |||
|
Thanks Adrien. Please also remember to upload the diagnostic.data covering both that and replBatchLimitOperations=1000 run when that has finished. | |||
| Comment by Adrien Jarthon [ 17/Jul/18 ] | |||
|
Ok here is the result with 3.6.6 (2G cache), it's better than 3.6.5 as the secondary now survives the 11am burst, we can see that it lasts less time with this version, because it manages to follow the primary faster and accumulates less lag. But unfortunately it's still saturating at 60%+ CPU and generating some lag compared to 3.4.15 which tops at 20%. I'll now try with BatchLimitOperations=1000:
| |||
| Comment by Adrien Jarthon [ 13/Jul/18 ] | |||
|
No problem Indeed that's a good lead, I've just installed 3.6.6 (with 2G cache limit to have comparable metrics) and will let you know! | |||
| Comment by Bruce Lucas (Inactive) [ 13/Jul/18 ] | |||
|
Hi Adrien, Sorry about that, I was misreading the data and your rs config as you say indicates you are already on pv1, so that potential issue is ruled out. The next issue we suspect is
We want to determine whether the smaller batch size addresses the problem you're seeing, and also verify that it doesn't create any other problems. Thanks, | |||
| Comment by Adrien Jarthon [ 12/Jul/18 ] | |||
|
You're welcome. First of all about my test with more memory (3.5G) showed that it's a bit better, the secondary (3.6.5) survived my 11am burst for the first time, but it's still wayyy slower than 3.4.15: About the protocol version, it seems I'm already on pv1:
Here is the secondary log if you want to check: mongodb.log.gz | |||
| Comment by Bruce Lucas (Inactive) [ 12/Jul/18 ] | |||
|
Thanks for the additional data Adrien. I think you are correct that cache pressure is an issue here, and would not be surprised if increasing the cache size improves the behavior. There are a few factors that may be contributing to cache pressure, and we would like to eliminate those one at a time to see whether it improves the behavior. The first relates to the use of protocol version 0 (pv0) with 3.6. This is not an ideal combination, as it requires keeping additional update history in the cache on the secondary, and could be the cause of what you're seeing. The default protocol version for new 3.6 replica sets is pv1, but replica sets created under prior versions will continue to use pv0 until explicitly upgraded. You can confirm that your replica set is using pv0 by looking in the mognod log for the 3.6 node; there will be a startup warning. If you don't see that, can you please upload the mongod log file for 3.6 so we can take a look? If you do see the warning can you please upgrade to pv1 and try running the secondary on 3.6 again and then upload diagnostic.data once the update job has finished? Ideally this would use the same cache size as before to get a fair comparison. Thanks, | |||
| Comment by Adrien Jarthon [ 11/Jul/18 ] | |||
|
Ah yes indeed, I just uploaded 2 new archives. | |||
| Comment by Bruce Lucas (Inactive) [ 11/Jul/18 ] | |||
|
Hi Adrien, The two-day comparison on different versions will be useful since as you say the load is constant. Can you please upload the most recent diagnostic.data from primary and secondary, which should include July 9th? The latest upload we have is from July 8th so it doesn't include that data. Thanks, | |||
| Comment by Adrien Jarthon [ 11/Jul/18 ] | |||
|
Thanks for having a look, I can't show you with two secondaries because I only have one (third member is an arbiter). But I can clearly show you the difference between both version as my workload on the primary is extremely stable, it's the same each day. (due to the nature of the service, it's updown.io, a monitoring service so the mongo traffic is 99% period requests) I made a new chart by overlapping two day (Jul 8 and Jul 9) showing the CPU usage difference between 3.4 and 3.6 during the night and during the 11am spike:
Before 2am it's different because it was re-syncing but after 2am you can see the CPU difference during regular load, which as you said is about 2 times and much less stable. Then at 11am we can see the CPU activity during a higher load, the workload on the primary is the same and here the CPU activity on the secondary is not only 4 times higher (~18% → 64%), but it doesn't even manage to follow the primary, gets out of sync and stops. Which means that it would require even more than 4 times the CPU to be able to follow the same load. (the reason why the machine is saturating at ~70% CPU is because it's a VM with stolen CPU) You can see this in the diagnostic.data files of July 8 for 3.6.5 and July 9 for 3.4.15, the load spike starts at 11am UTC+2 every day and lasts for about 1h. And you can see the workload charts in my cloud manager account I guess? (https://cloud.mongodb.com/v2/5012a0ac87d1d86fa8c22e64#host/replicaSet/5414cfabe4b0ce23e21b4b3b), otherwise here is a chart showing the stability of the primary workload during July 8 and 9 (spike is the 11am burst):
| |||
| Comment by Bruce Lucas (Inactive) [ 09/Jul/18 ] | |||
|
Hi Adrien, Thanks for the data you have collected so far, and thanks for the offer to continue helping us diagnose what you are seeing. In the data that you've uploaded the most notable notable correlate with higher CPU utilization is secondary lag:
This is seen at A-B (following initial sync) and C-D and E-F (following some amount of downtime). When the secondary is lagged it will be doing extra work to catch up, which we can see in the "repl apply ops" metric, and this will require more CPU. After the lag catch-up CPU utilization has been accounted for we still see a residual CPU increase of about double, going from 4-5% to 8-10%. In absolute terms this is not a large increase, but we will look into it further. However you have indicated the possibility of some other effects which are larger, e.g. related to the daily update job, but I don't think we have clean enough data to investigate that. Would you be willing to collect data with one secondary on 3.6.5 and the other on 3.4.15 for comparison, for at least 24 hours? If you need to restart a secondary to get it on 3.6.5 please try to minimize the downtime between restarts to minimize the amount of lag that the secondary has to catch up to. Also please try to restart it during a period of steady load so it has plenty of time to stabilize before any unusual load such as the daily update job. If there are any events you want to call our attention to please give us a timeline with as accurate times as you can, including dates and timezone. Once you have 24 hours of data on 3.6.5 please upload the content of diagostic.data from all three nodes. Also if you can upload the CPU charts from both secondaries that would be useful - as you can see from above we collect CPU information in diagnostic.data, but it will be good to have independent confirmation of that from your charts. Thanks, | |||
| Comment by Adrien Jarthon [ 08/Jul/18 ] | |||
|
Ok to help you pinpoint the issue I erased the data and went back to 3.4.15 to make the upgrade again one version at a time and see the impact. I uploaded the new diagnostic.data files covering all these changes to the portal. Here are the upgrades I made since last time: Here's the CPU chart: All these upgrade showed that the regression seems to be entirely in 3.4.15 → 3.6.0, as the CPU level is basically the same between this version, it looks a bit lower than in my first tests (Jul 04), I'm not sure why as the Jul 04 test was from re-synced data already, but it's definitely way higher than in 3.4.15, the re-sync is slower and takes much more CPU, and the update burst I have every morning at 11am which used to up the secondary CPU to 20% is enough in 3.6.x to get the CPU to 100% and loose the secondary out of sync... Let me know if I can do anything else to help you diagnose this, in the meanwhile I'm going back to 3.4.15 as the upgrade is not possible. | |||
| Comment by Adrien Jarthon [ 04/Jul/18 ] | |||
|
Sure, that's uploaded on the portal. For the secondary I forgot to keep the diagnostic folder when I resynced today so it only starts at ~13:00 but that's enough to see 3.4.15 and 3.6.0. | |||
| Comment by Ramon Fernandez Marina [ 04/Jul/18 ] | |||
|
bigbourin@gmail.com, will you please upload the contents of the diagnostic.data directories for the primary and the affected secondary? You can attach them to the ticket if they're not too large, or upload them via this secure portal. Please comment on the ticket when you're done so we can investigate. Thanks, | |||
| Comment by Adrien Jarthon [ 04/Jul/18 ] | |||
|
Here is a couple more info, I let 3.6.5 run for a while and this morning at 11am I have a cron which does a lot of mongo updates, this usually take the secondary CPU load from 6% to ~15%, you can see on the following graph that with 3.6.5 it took it from 20% to 80%, the secondary started lagging as it couldn't keep up and at some point lost the oplog race and I had to resync it To help you find the regression more easily I decided to resync it with 3.4.15, first to see if I obverse the same load difference, and the to do the upgrade one version at a time. You can see at 13:30 I started the resync, which finished at 15:00 and the load was back to more "normal" and stable levels as I'm used to. I then tried upgrading to 3.6.0 at 18:00, you can see the resync took a lot more CPU, and then the load on the server is lower than on 3.6.5 but still higher and much less stable than in 3.4.15. I'll let it stabilize a bit to get better average numbers and then continue with 3.6.1 I can provider the diagnostic.data if that is of any help.
BTW I had some trouble starting 3.6.0 due to a bug with "bindIpAll" which makes the server start but not answer any connection (and the log says it couldn't bind the port and shutdown, but it's still running...), this seems not to be present in 3.6.5 though so I'm considering this a fixed bug ^^ | |||
| Comment by Adrien Jarthon [ 04/Jul/18 ] | |||
|
Here is the image that I couldn't upload at ticket creation: And another one showing the I/O doubled too (same bandwidth but twice the IOPS): |