[SERVER-51104] Performance degradation on mongos 4.4.x Created: 23/Sep/20 Updated: 07/Sep/21 Resolved: 01/Jul/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.0, 4.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Trivial - P5 |
| Reporter: | Žygimantas Stauga | Assignee: | Eric Sedor |
| Resolution: | Incomplete | Votes: | 1 |
| Labels: | KP44 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Community version. GCP VMs with ubuntu 18.04. 6x mongos, 3x configs, 45 (15 shards) mongod servers. |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
We did an upgrade to the 4.4.0 Aug 25th. Right after the upgrade was complete we noticed performance degradation across the entire system. Even queries by sharding key, which returns one document from the small collection (~30k documents), started to degrade in performance. For example, we saw cases where log entry on the mongos side says that the Slow query took 6-7 sec, but we have nothing on the mongod side which means, that query took less than 100ms. We tried to find bottlenecks, tried to temporarily resize instances, add more mongos instances, even disabled TLS, but the results were the same. Then the 4.4.1 version came out, but the upgrade didn’t change anything. So we decided to downgrade to the 4.2 release. At this time, we monitored various system parts as downgrade was performed. And the performance was back as soon as we restarted mongos instances with the 4.2 binaries and stayed at the same levels as we downgraded all the shards one by one. An interesting thing, that basic server metrics like CPU, Load, Memory, Disk Activity didn’t change at all during upgrade and downgrade. Just mongos instances became slower with version 4.4 for some reason. Not sure what metrics I can share as we didn't find anything that can show where the problem is. We do have metric history. If someone has an idea of what metrics could show something interesting, there is a chance that we have that, just do not monitor in dashboards. The attachment shows avg latency for a query by sharding key in the sharded collection with ~30k documents. |
| Comments |
| Comment by Valentin Abalmasov [ 07/Sep/21 ] | |||||||
|
What we currently see in mongodb.log of mongos is a lot of slow DNS resolution issues
i did hide the real DNS names but I can confirm that we have all those names in /etc/hosts file and they should be resolved locally | |||||||
| Comment by Valentin Abalmasov [ 07/Sep/21 ] | |||||||
|
Hello Eric. We've experienced the same performance degradation issue when upgraded our cluster from 4.2.15-ent to 4.4.8-ent. I'm attaching diagnostic.data from 1 mongos and from 1 shard's primary mongod. We are also going to revert version back to 4.2.15-ent until we understand how can this issue be fixed.
Thanks in advance,
| |||||||
| Comment by Eric Sedor [ 01/Jul/21 ] | |||||||
|
I'm going to close this ticket for now. But, we are interested in re-opening it if, for an affected mongos router and for a representative shard primary spanning a time period that includes the incident, we can obtain an archive (tar or zip) of:
| |||||||
| Comment by Eric Sedor [ 21/May/21 ] | |||||||
|
zhishangyinyu@hotmail.com, ep@tribepayments.com, We understand time can be precious. If at any point you can provide logs and diagnostic data, we are interested in investigating further. Gratefully, | |||||||
| Comment by Eimantas Puskorius [ 23/Apr/21 ] | |||||||
|
Same problem on cluster, about 15-30% of performance was lost. Sry for no logs, since currently we do not have time for debugging. | |||||||
| Comment by Eric Sedor [ 20/Apr/21 ] | |||||||
|
zhishangyinyu@hotmail.com, if you have information that can help us investigate further, we'd like to examine it in a new SERVER ticket. For an affected mongos router and for a representative shard primary spanning a time period that includes the incident, please archive (tar or zip) and attach to the ticket the:
Thank you! | |||||||
| Comment by Žygimantas Stauga [ 12/Apr/21 ] | |||||||
|
Hi there, Sorry about not responding. We were about to enter the Black Friday season and had no time for this issue. We downgraded to the 4.2 version, and there is no way we will risk upgrading to 4.4 again. Log files are long gone. With more people commenting on the ticket, I believe the problem is still there, whatever it is. I'm currently moving to another department and will not get back to this issue. You can close this ticket. | |||||||
| Comment by dong dong [ 12/Apr/21 ] | |||||||
|
me too! response times increase significantly! | |||||||
| Comment by Dave Gotlieb [ 27/Jan/21 ] | |||||||
|
Hi Alex, yes that is true, but for the java driver, it appears not: We are experiencing similar performance issues with v4.4 upgrade with java drivers v3.8 http://s3.amazonaws.com/info-mongodb-com/_com_assets/cms/mongodb-for-giant-ideas-bbab5c3cf8.png<https://docs.mongodb.com/drivers/java>
<EDIT> I just got off a call with our Atlas engineering support. Our issues are now being tracked as internal to MongoDB cluster communication itself, and not with client connections and requests placed by incompatible driver. We have updated our java driver to 4.1 and the hunt for root cause continues and we are now on mongodb v4.4.3. Sorry to mix Node vs Java driver compatibility.
Regards, | |||||||
| Comment by Alexandru Baetu [ 26/Jan/21 ] | |||||||
|
I checked the compatibility documentation for node driver and looks like 3.6 version is compatible with mongod 4.4. | |||||||
| Comment by Dave Gotlieb [ 26/Jan/21 ] | |||||||
|
I note that Alin Dumitru mentions "The client is Node.js and uses mongodb lib version 3.6.2" when he was upgrading to v4.4.1 - per compatibility documentation, v4.4.1 is only compatible with the 4.1 drivers. We have been experiencing similar results when upgrading to 4.4. We are in effort to upgrade the java client to 4.1. But, we may still be having performance issues. We have found some new configs are available in 4.4 and 4.4 sends some responses back to the client is a different way. Hence the likely docs stating 4.4 requires 4.1 client version. All the above stated. Has anyone resolved the performance issues addressed in this ticket? We have a similar support ticket open Case #00738730 for our ongoing issues. | |||||||
| Comment by Bruce Lucas (Inactive) [ 13/Oct/20 ] | |||||||
|
alin.silvian@gmail.com, can you please open a separate ticket with the details of your issue, as we should not assume at the outset that performance problems experienced by different users are the same underlying issue. When you open that ticket, can you please archive and attach $dbpath/diagnostic.data and a substantial span of logs covering the issue, together with a timeline of when you experienced the issue at the application level. | |||||||
| Comment by Alin Dumitru [ 13/Oct/20 ] | |||||||
|
Hello,
The client is Node.js and uses mongodb lib version 3.6.2 If you need additional info please let me know. Thank you! | |||||||
| Comment by Daniel Pasette (Inactive) [ 12/Oct/20 ] | |||||||
|
Hi zygimantas@omnisend.com, are you still experiencing the performance issue? | |||||||
| Comment by Bruce Lucas (Inactive) [ 25/Sep/20 ] | |||||||
|
Thanks for the information. That may be a possibility; we'll investigate that on our end. Meanwhile, do you have jq available, or could you easily install it? If so, would it be possible to run a script like this on the 4.4. log files?
This will redact the log files of all sensitive information, preserving only the top-level fields which will not have any user-supplied information. You can then attach to the ticket or upload to the secure private portal that I provided above. This will allow us to correlate the slow queries and other events in the logs with the information in the diagnostic.data that you provided. | |||||||
| Comment by Žygimantas Stauga [ 25/Sep/20 ] | |||||||
|
Indeed there are quite a lot log entries something like that:
All our hostnames are just a string like production-mongodb-c1-s6-n2-us-central1-b and IP addresses are populated via /etc/hosts
DNS lookups for the hostnames in the /etc/hosts usually are very quick. Just tested on the server and it is ~10ms per lookup. Maybe MongoDB 4.4 started to react differently to such a hostname format as this is not as usual as subdomain.domain.tld. | |||||||
| Comment by Bruce Lucas (Inactive) [ 24/Sep/20 ] | |||||||
|
zygimantas@omnisend.com, no problem, let's see how far we can get that way. We see a couple of pieces of evidence pointing to possible issues with slow DNS in your environment, and maybe increased sensitivity to that in MongoDB 4.4.
Can you help us investigate this theory:
| |||||||
| Comment by Žygimantas Stauga [ 24/Sep/20 ] | |||||||
|
If it's possible, we want to go without uploading the logs. Logs contain personal data from our customers and customers of our customers, so we can't just share that. | |||||||
| Comment by Bruce Lucas (Inactive) [ 24/Sep/20 ] | |||||||
|
zygimantas@omnisend.com, would you be able to provide the mongos log files covering the time you were running 4.4? If you aren't comfortable attaching them to this ticket, you can upload them to this secure upload portal. | |||||||
| Comment by Žygimantas Stauga [ 24/Sep/20 ] | |||||||
|
Ok, found it. Looks like the oldest file is from Sep 11, should be good enough. EDIT: for some reason, attachments aren't uploading here | |||||||
| Comment by Daniel Pasette (Inactive) [ 23/Sep/20 ] | |||||||
|
See this docs on diagnostic data for mongos (available starting in v3.6): https://docs.mongodb.com/manual/reference/parameters/#diagnostic-parameters ~~~~~~~~~~~~~~ For mongos, the diagnostic data files, by default, are stored in a directory under the mongos instance’s --logpath or systemLog.path directory. The diagnostic data directory is computed by truncating the logpath’s file extension(s) and concatenating diagnostic.data to the remaining name. For example, if mongos has --logpath /var/log/mongodb/mongos.log.201708015, then the diagnostic data directory is /var/log/mongodb/mongos.diagnostic.data/ directory. To specify a different diagnostic data directory for mongos, set the diagnosticDataCollectionDirectoryPath parameter. | |||||||
| Comment by Žygimantas Stauga [ 23/Sep/20 ] | |||||||
|
AFAIK mongos doesn't have dbpath. | |||||||
| Comment by Eric Sedor [ 23/Sep/20 ] | |||||||
|
zygimantas@omnisend.com, as long as the downgrade was performed in place, the data Dan describes should still be there. As long as the downgrade occurred within a few days, it should contain data for both versions. Can you tar or zip the $dbpath/diagnostic.data directory for a mongos and attach it to this ticket? This is the information we'd need to investigate. Gratefully, | |||||||
| Comment by Žygimantas Stauga [ 23/Sep/20 ] | |||||||
|
The entire cluster is already downgraded. Shards itself weren't affected, mongos was. As soon we restarted mongos with the old binaries, the degradation was gone. Shards at that point were still running 4.4.x. | |||||||
| Comment by Daniel Pasette (Inactive) [ 23/Sep/20 ] | |||||||
|
Can you attach FTDC and run log files for the primary of one of the shards that was impacted. Hopefully we’ll be able to see a comparison of the stats between 4.4 and 4.2. |