[SERVER-56237] Degraded performance on ARM AWS Graviton2 instances compared with x86_64 architecture Created: 21/Apr/21 Updated: 08/Aug/21 Resolved: 08/Aug/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ivan Takarlikov | Assignee: | Dmitry Agranat |
| Resolution: | Done | Votes: | 2 |
| Labels: | Bug | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: |
|
||||||||||||
| Participants: | |||||||||||||
| Description |
|
We migrated our mongodb server from AWS r5d.16xlarge to r6gd.16xlarge (Intel Xeon Platinum 8000 -> AWS Graviton 2) and faced with significant performance degradation. Performance degrades in case of many parallel queries of any nature (we noticed degradation on aggregation/distinct/filter queries, so, don't assume that issue is related to kind of query) On the screenshot below red metrics for two same databases in the one ReplicaSet, the green one is ARM(r6gd.16xlarge), the yellow one is x86_64(r5d.16xlarge), the vertical red bar is the point to a time when we switched primaries (from ARM to x86_64). Our clients reading only from primaries. How you can see, under the same kind of load produce 100% load on the ARM server but ~10% load on the x86_64 server. Load is constant within a day.
|
| Comments |
| Comment by Dmitry Agranat [ 29/Jul/21 ] | ||||||||||||||||||
|
Thanks ivan.takarlikov@sensortower.com for uploading the diagnostic.data. Though it is from a different replica set than what we have originally investigated back in April, I do not see any symptoms of the issue that we've investigated so far. Overall, the performance seems normal w/o saturating any resources, specifically CPU. We are pleased to see that the new release seems to be a big improvement for you, please let us know if you need anything else. Regards, | ||||||||||||||||||
| Comment by Dmitry Agranat [ 27/Jul/21 ] | ||||||||||||||||||
|
Thanks ivan.takarlikov@sensortower.com, I can see the uploaded file diag.data.tar.gz | ||||||||||||||||||
| Comment by Ivan Takarlikov [ 27/Jul/21 ] | ||||||||||||||||||
|
Yeah, that works, file sent, could you please confirm that you get it? | ||||||||||||||||||
| Comment by Dmitry Agranat [ 27/Jul/21 ] | ||||||||||||||||||
|
Appologies ivan.takarlikov@sensortower.com, can you try this link? | ||||||||||||||||||
| Comment by Ivan Takarlikov [ 27/Jul/21 ] | ||||||||||||||||||
|
Are you sure that it is the correct link? It requires me to auth on corp.mongodb.org with user/pass, but I logged in before to Jira with Google SAML, so looks like there is no user/pass for me. | ||||||||||||||||||
| Comment by Dmitry Agranat [ 27/Jul/21 ] | ||||||||||||||||||
|
Yes ivan.takarlikov@sensortower.com, that would be great. | ||||||||||||||||||
| Comment by Ivan Takarlikov [ 27/Jul/21 ] | ||||||||||||||||||
|
Hey! @Dmitry Agranat yeah, we upgraded much of our primaries and secondaries. Do you need diagnostic.data from one of primaries that were upgraded? | ||||||||||||||||||
| Comment by Dmitry Agranat [ 27/Jul/21 ] | ||||||||||||||||||
|
Hi ivan.takarlikov@sensortower.com, did you have a chance to upgrade the rest of the members and let it run for a couple of days? | ||||||||||||||||||
| Comment by Ivan Takarlikov [ 20/Jul/21 ] | ||||||||||||||||||
|
@Dmitry Agranat, yeah, sure! | ||||||||||||||||||
| Comment by Dmitry Agranat [ 20/Jul/21 ] | ||||||||||||||||||
|
Thanks for the update ivan.takarlikov@sensortower.com. Once all the members of the replSet are upgraded, could you give it a couple of days and then upload the diagnostic.data into this secure portal? We could use this data to compare to the previously collected information to quantify the change. | ||||||||||||||||||
| Comment by Ivan Takarlikov [ 20/Jul/21 ] | ||||||||||||||||||
|
Hey, @Dmitry Agranat, yeah, we upgraded it on one of our primaries at 7/17, spikes of load significantly decreased as you can see on screenshot, aiming to upgrade it on the rest of our ARM primaries/secondaries on that week | ||||||||||||||||||
| Comment by Dmitry Agranat [ 20/Jul/21 ] | ||||||||||||||||||
|
Hi oliver@sensortower.com, did you have a chance to test with 4.4.7? | ||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 16/Jul/21 ] | ||||||||||||||||||
|
4.4.7 is now GA: https://www.mongodb.com/community/forums/t/mongodb-4-4-7-is-released/115467 | ||||||||||||||||||
| Comment by Dmitry Agranat [ 16/Jul/21 ] | ||||||||||||||||||
|
oliver@sensortower.com, 4.4.7 is currently in code review, I will send an additional update at the beginning of next week. | ||||||||||||||||||
| Comment by Dmitry Agranat [ 15/Jul/21 ] | ||||||||||||||||||
|
Hi oliver@sensortower.com, I will check and get back to you with 4.4.7 release information. | ||||||||||||||||||
| Comment by Oliver Yeh [ 13/Jul/21 ] | ||||||||||||||||||
|
Hey dmitry.agranat , Any idea when 4.4.7 will be released? The release candidate has been dormant for 3 weeks now.
Thanks Oliver | ||||||||||||||||||
| Comment by Ivan Takarlikov [ 24/Jun/21 ] | ||||||||||||||||||
|
Hey, @Dmitry Agranat, thanks for letting us know! | ||||||||||||||||||
| Comment by Oliver Yeh [ 24/Jun/21 ] | ||||||||||||||||||
|
| ||||||||||||||||||
| Comment by Dmitry Agranat [ 24/Jun/21 ] | ||||||||||||||||||
|
Hi ivan.takarlikov@sensortower.com, oliver@sensortower.com, the | ||||||||||||||||||
| Comment by Dmitry Agranat [ 20/May/21 ] | ||||||||||||||||||
|
oliver@sensortower.com, my last comment was just to underline the importance of using a platform with a libc that uses LSE, that's all. | ||||||||||||||||||
| Comment by Oliver Yeh [ 18/May/21 ] | ||||||||||||||||||
|
Hi Dmitry - It is Ivan and my understanding that this option
does not require a custom 4.4.5 build and therefore does not require you to know what OS version we are on. Please let us know if we misunderstood you. Sorry about the hassle. GCC 8.5 was just released so
| ||||||||||||||||||
| Comment by Ivan Takarlikov [ 18/May/21 ] | ||||||||||||||||||
|
We are running on Amazon Linux 2 | ||||||||||||||||||
| Comment by Dmitry Agranat [ 18/May/21 ] | ||||||||||||||||||
|
ivan.takarlikov@sensortower.com, as per one of my earlier comments, could you please clarify the exact OS you intend to deploy the fix on? It is important for a platform to be with a libc that uses LSE, per https://github.com/aws/aws-graviton-getting-started/blob/main/c-c%2B%2B.md#large-system-extensions-lse. Note that on Ubuntu 18.04 you would need to install an alternative libc package, called libc6-lse. | ||||||||||||||||||
| Comment by Ivan Takarlikov [ 17/May/21 ] | ||||||||||||||||||
|
Hey, Dmitry! Looks like we got GCC 8.5 released - https://gcc.gnu.org/releases.html Can we expect a fix for our issue soon? | ||||||||||||||||||
| Comment by Ivan Takarlikov [ 30/Apr/21 ] | ||||||||||||||||||
|
Hey! We prefer to proceed with 2nd option - (Waiting for the release of Thanks for help! | ||||||||||||||||||
| Comment by Dmitry Agranat [ 28/Apr/21 ] | ||||||||||||||||||
|
Hi ivan.takarlikov@sensortower.com, I have a few follow-up options in regards to the next steps. In order to progress this investigation, these are the options that should bring us closer to the reported issue resolution:
Please let us know which option works best for you. | ||||||||||||||||||
| Comment by Ivan Takarlikov [ 27/Apr/21 ] | ||||||||||||||||||
|
Thanks, Dmitry! Should we wait for the new mongo minor release (4.4.6 for example) after the release of GCC 8.5? | ||||||||||||||||||
| Comment by Dmitry Agranat [ 27/Apr/21 ] | ||||||||||||||||||
|
Thanks ivan.takarlikov@sensortower.com for providing the requested information. It turns out the reason you have experienced such an issue on the ARM instance is because of You can start watching Regards, | ||||||||||||||||||
| Comment by Ivan Takarlikov [ 26/Apr/21 ] | ||||||||||||||||||
|
Attached files with logs/diagnostic.data and perf data. Perf data was collected during that period 2021-04-26T14:24:26Z - 2021-04-26T14:31:33Z (it also reflected on file names)
About `Slow query` - yeah, it was cleaned by me, because it contains some sensitive info. But if that info important for you, I can provide an example of that logs with changed sensitive data to a random one and with the kept structure of log. BTW, thanks for the investigation!
| ||||||||||||||||||
| Comment by Dmitry Agranat [ 26/Apr/21 ] | ||||||||||||||||||
|
Thanks ivan.takarlikov@sensortower.com, after inspecting the provided data, I see what you mean. I have a couple of clarification points at this stage:
How to record perf call stack samples and generate text output:
After the perf data is collected, we will need the exact timestamp when the perf data was collected, a fresh set of diagnostic.data and mongod logs covering the time of the event. Note that it is important to run perf script on the same node where perf.data was generated so that it can be correctly symbolized using the addresses on that machine. Also, the perf utility, which is a part of linux-tools package, is not installed by default. | ||||||||||||||||||
| Comment by Ivan Takarlikov [ 23/Apr/21 ] | ||||||||||||||||||
|
There are two log files and data directories from both x86 and ARM servers. ARM server was primary and experienced problems since Fri Apr 23 09:30:00 UTC 2021, so we switched that primary back to x86 at Fri Apr 23 15:33:10 UTC 2021 (see screenshot for details). After switch request rate and nature of queries to mongo stayed the same but load decreased significantly on x86 server. MongoDB versions are the same on both instances - 4.4.5 | ||||||||||||||||||
| Comment by Dmitry Agranat [ 21/Apr/21 ] | ||||||||||||||||||
|
Hi ivan.takarlikov@sensortower.com, Would you please archive (tar or zip) the mongod.log files covering the incident and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Please mention the exact timestamp (start/end) and the timezone of the event you'd like us to investigate. So that we could compare and comment on the reported degradation, please upload the requested data separately for:
One clarifying question at this time, was the MongoDB version the same on these two instances? Dima | ||||||||||||||||||
| Comment by Ivan Takarlikov [ 21/Apr/21 ] | ||||||||||||||||||
|
Screenshot from description is in attachments |