[SERVER-56237] Degraded performance on ARM AWS Graviton2 instances compared with x86_64 architecture Created: 21/Apr/21  Updated: 08/Aug/21  Resolved: 08/Aug/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ivan Takarlikov Assignee: Dmitry Agranat
Resolution: Done Votes: 2
Labels: Bug
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2021-04-27 at 16.21.33.png     PNG File Screen Shot 2021-04-27 at 16.21.56.png     PNG File image-2021-04-30-19-10-21-812.png     PNG File image-2021-07-20-16-55-27-165.png     File perf_SERVER-56237.svg     PNG File screenshot-1.png    
Issue Links:
Related
related to SERVER-56347 Enable LSE intrinsics in release Closed
is related to SERVER-56347 Enable LSE intrinsics in release Closed
Operating System: ALL
Steps To Reproduce:
  1. Deploy 2 mongodb databases, one on ARM AWS server, the second one on x86_64 arch
  2. Make ARM server primary
  3. Put many concurrent load there, notice 100% load
  4. Switch primary to x86_64, notice reduced load
Participants:

 Description   

We migrated our mongodb server from AWS r5d.16xlarge to r6gd.16xlarge (Intel Xeon Platinum 8000  -> AWS Graviton 2) and faced with significant performance degradation.

Performance degrades in case of many parallel queries of any nature (we noticed degradation on aggregation/distinct/filter queries, so, don't assume that issue is related to kind of query)

On the screenshot below red metrics for two same databases in the one ReplicaSet, the green one is ARM(r6gd.16xlarge), the yellow one is x86_64(r5d.16xlarge), the vertical red bar is the point to a time when we switched primaries (from ARM to x86_64). Our clients reading only from primaries. How you can see, under the same kind of load produce 100% load on the ARM server but ~10% load on the x86_64 server. Load is constant within a day.



 Comments   
Comment by Dmitry Agranat [ 29/Jul/21 ]

Thanks ivan.takarlikov@sensortower.com for uploading the diagnostic.data. Though it is from a different replica set than what we have originally investigated back in April, I do not see any symptoms of the issue that we've investigated so far. Overall, the performance seems normal w/o saturating any resources, specifically CPU.

We are pleased to see that the new release seems to be a big improvement for you, please let us know if you need anything else.

Regards,
Dima

Comment by Dmitry Agranat [ 27/Jul/21 ]

Thanks ivan.takarlikov@sensortower.com, I can see the uploaded file diag.data.tar.gz

Comment by Ivan Takarlikov [ 27/Jul/21 ]

Yeah, that works, file sent, could you please confirm that you get it? 

Comment by Dmitry Agranat [ 27/Jul/21 ]

Appologies ivan.takarlikov@sensortower.com, can you try this link?

Comment by Ivan Takarlikov [ 27/Jul/21 ]

Are you sure that it is the correct link? It requires me to auth on corp.mongodb.org with user/pass, but I logged in before to Jira with Google SAML, so looks like there is no user/pass for me. 

Comment by Dmitry Agranat [ 27/Jul/21 ]

Yes ivan.takarlikov@sensortower.com, that would be great.

Comment by Ivan Takarlikov [ 27/Jul/21 ]

Hey! @Dmitry Agranat yeah, we upgraded much of our primaries and secondaries. Do you need diagnostic.data from one of primaries that were upgraded? 

Comment by Dmitry Agranat [ 27/Jul/21 ]

Hi ivan.takarlikov@sensortower.com, did you have a chance to upgrade the rest of the members and let it run for a couple of days?

Comment by Ivan Takarlikov [ 20/Jul/21 ]

@Dmitry Agranat, yeah, sure!

Comment by Dmitry Agranat [ 20/Jul/21 ]

Thanks for the update ivan.takarlikov@sensortower.com. Once all the members of the replSet are upgraded, could you give it a couple of days and then upload the diagnostic.data into this secure portal? We could use this data to compare to the previously collected information to quantify the change.

Comment by Ivan Takarlikov [ 20/Jul/21 ]

Hey, @Dmitry Agranat, yeah, we upgraded it on one of our primaries at 7/17, spikes of load significantly decreased as you can see on screenshot,

aiming to upgrade it on the rest of our ARM primaries/secondaries on that week

Comment by Dmitry Agranat [ 20/Jul/21 ]

Hi oliver@sensortower.com, did you have a chance to test with 4.4.7?

Comment by Daniel Pasette (Inactive) [ 16/Jul/21 ]

4.4.7 is now GA:

https://www.mongodb.com/community/forums/t/mongodb-4-4-7-is-released/115467

Comment by Dmitry Agranat [ 16/Jul/21 ]

oliver@sensortower.com, 4.4.7 is currently in code review, I will send an additional update at the beginning of next week.

Comment by Dmitry Agranat [ 15/Jul/21 ]

Hi oliver@sensortower.com, I will check and get back to you with 4.4.7 release information.

Comment by Oliver Yeh [ 13/Jul/21 ]

Hey dmitry.agranat , Any idea when 4.4.7 will be released? The release candidate has been dormant for 3 weeks now.

 

Thanks

Oliver

Comment by Ivan Takarlikov [ 24/Jun/21 ]

Hey, @Dmitry Agranat, thanks for letting us know!
As I understand there is only one way to test 4.4.7-rc0 version - is to replace binaries from a tgz archive from Mongo Download Center. I did it on our testing servers and it run well. But there is no significant load to actually check the fix. So, let's wait till 4.4.7 would be in the yum repo to test safely it on actual production load because I think that replacing binaries is not the best way to upgrade mongo version on production databases .

Comment by Oliver Yeh [ 24/Jun/21 ]

 - we were actually waiting for 4.4.7 to come out for us to try it.  But we will test it soon, depends on Ivan's schedule!

Comment by Dmitry Agranat [ 24/Jun/21 ]

Hi ivan.takarlikov@sensortower.com, oliver@sensortower.com, the SERVER-56347 is fixed and you can try it out on the release candidate 4.4.7-rc0. We are very interested in hearing from you about the results.

Comment by Dmitry Agranat [ 20/May/21 ]

oliver@sensortower.com, my last comment was just to underline the importance of using a platform with a libc that uses LSE, that's all.

Comment by Oliver Yeh [ 18/May/21 ]

Hi Dmitry - 

It is Ivan and my understanding that this option

  • Waiting for the release of SERVER-56347 (Currently, I am unable to provide the ETA for its release).

does not require a custom 4.4.5 build and therefore does not require you to know what OS version we are on.  Please let us know if we misunderstood you.  Sorry about the hassle. GCC 8.5 was just released so SERVER-56347 should be resolved soon.

 

Comment by Ivan Takarlikov [ 18/May/21 ]

We are running on Amazon Linux 2 

Comment by Dmitry Agranat [ 18/May/21 ]

ivan.takarlikov@sensortower.com, as per one of my earlier comments, could you please clarify the exact OS you intend to deploy the fix on? It is important for a platform to be with a libc that uses LSE, per https://github.com/aws/aws-graviton-getting-started/blob/main/c-c%2B%2B.md#large-system-extensions-lse. Note that on Ubuntu 18.04 you would need to install an alternative libc package, called libc6-lse.

Comment by Ivan Takarlikov [ 17/May/21 ]

Hey, Dmitry! Looks like we got GCC 8.5 released - https://gcc.gnu.org/releases.html 

Can we expect a fix for our issue soon? 

Comment by Ivan Takarlikov [ 30/Apr/21 ]

Hey! We prefer to proceed with 2nd option - (Waiting for the release of SERVER-56347 )

Thanks for help! 

Comment by Dmitry Agranat [ 28/Apr/21 ]

Hi ivan.takarlikov@sensortower.com, I have a few follow-up options in regards to the next steps. In order to progress this investigation, these are the options that should bring us closer to the reported issue resolution:

  • Providing you with a custom 4.4.5 build which targets armv8.1-a or armv8.2-a, where the LSE intrinsics are built-in. If this is a valid option for you, could you provide the exact OS version and MongoDB edition you would need? I saw that you were using Amazon Linux release 2 (Karoo) but I am not sure if this is an enterprise or community edition.
  • Waiting for the release of SERVER-56347 (Currently, I am unable to provide the ETA for its release).
  • Providing us with a simple reproducer that shows the reported difference between the ARM and x86_64 architecture.

Please let us know which option works best for you.
Dima

Comment by Ivan Takarlikov [ 27/Apr/21 ]

Thanks, Dmitry! 

Should we wait for the new mongo minor release (4.4.6 for example) after the release of GCC 8.5? 

Comment by Dmitry Agranat [ 27/Apr/21 ]

Thanks ivan.takarlikov@sensortower.com for providing the requested information.

It turns out the reason you have experienced such an issue on the ARM instance is because of SERVER-56347. And we are currently waiting for GCC 8.5 release.

You can start watching SERVER-56347 for updates and please do let us know if you have any questions.

Regards,
Dima

Comment by Ivan Takarlikov [ 26/Apr/21 ]

Attached files with logs/diagnostic.data and perf data.

Perf data was collected during that period 2021-04-26T14:24:26Z - 2021-04-26T14:31:33Z (it also reflected on file names)

 

[ec2-user@itunes-sales-reports-arm-db-master perf_data]$ while true; do sudo perf record -a -g -F 99 -o perf.data.$(date -u +%FT%TZ) sleep 60; done
[ perf record: Woken up 265 times to write data ]
[ perf record: Captured and wrote 74.550 MB perf.data.2021-04-26T14:24:26Z (265712 samples) ]
[ perf record: Woken up 380 times to write data ]
[ perf record: Captured and wrote 106.594 MB perf.data.2021-04-26T14:25:26Z (378341 samples) ]
[ perf record: Woken up 324 times to write data ]
[ perf record: Captured and wrote 88.790 MB perf.data.2021-04-26T14:26:27Z (315432 samples) ]
[ perf record: Woken up 379 times to write data ]
[ perf record: Captured and wrote 106.679 MB perf.data.2021-04-26T14:27:28Z (378378 samples) ]
[ perf record: Woken up 373 times to write data ]
[ perf record: Captured and wrote 103.880 MB perf.data.2021-04-26T14:28:29Z (368561 samples) ]
[ perf record: Woken up 374 times to write data ]
[ perf record: Captured and wrote 106.713 MB perf.data.2021-04-26T14:29:30Z (378749 samples) ]
[ perf record: Woken up 376 times to write data ]
[ perf record: Captured and wrote 106.685 MB perf.data.2021-04-26T14:30:32Z (378772 samples) ]
[ perf record: Woken up 124 times to write data ]
[ perf record: Captured and wrote 47.446 MB perf.data.2021-04-26T14:31:33Z (167866 samples) ]

About `Slow query` - yeah, it was cleaned by me, because it contains some sensitive info. But if that info important for you, I can provide an example of that logs with changed sensitive data to a random one and with the kept structure of log.

BTW, thanks for the investigation!

 

Comment by Dmitry Agranat [ 26/Apr/21 ]

Thanks ivan.takarlikov@sensortower.com, after inspecting the provided data, I see what you mean. I have a couple of clarification points at this stage:

  • Is it possible to collect perf during a similar event when running on Graviton 2 instance?
  • Provided logs do not contain any "Slow query" information, is it because all the data was redacted?

How to record perf call stack samples and generate text output:

# capture in separate files of 60 seconds each
while true; do perf record -a -g -F 99 -o perf.data.$(date -u +%FT%TZ) sleep 60; done
# then run perf script as above on the subset of files of interest
for fn in ...; do perf script -i $fn >$fn.txt; done

After the perf data is collected, we will need the exact timestamp when the perf data was collected, a fresh set of diagnostic.data and mongod logs covering the time of the event.

Note that it is important to run perf script on the same node where perf.data was generated so that it can be correctly symbolized using the addresses on that machine. Also, the perf utility, which is a part of linux-tools package, is not installed by default.

Comment by Ivan Takarlikov [ 23/Apr/21 ]

There are two log files and data directories from both x86 and ARM servers. ARM server was primary and experienced problems since Fri Apr 23 09:30:00 UTC 2021, so we switched that primary back to x86 at Fri Apr 23 15:33:10 UTC 2021 (see screenshot for details). After switch request rate and nature of queries to mongo stayed the same but load decreased significantly on x86 server. 

MongoDB versions are the same on both instances - 4.4.5

Comment by Dmitry Agranat [ 21/Apr/21 ]

Hi ivan.takarlikov@sensortower.com,

Would you please archive (tar or zip) the mongod.log files covering the incident and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Please mention the exact timestamp (start/end) and the timezone of the event you'd like us to investigate.

So that we could compare and comment on the reported degradation, please upload the requested data separately for:

  • AWS r5d.16xlarge to r6gd.16xlarge (Intel Xeon Platinum 8000)
  • AWS r5d.16xlarge to r6gd.16xlarge (AWS Graviton 2)

One clarifying question at this time, was the MongoDB version the same on these two instances?

Dima

Comment by Ivan Takarlikov [ 21/Apr/21 ]

Screenshot from description is in attachments

Generated at Thu Feb 08 05:38:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.