[SERVER-51104] Performance degradation on mongos 4.4.x Created: 23/Sep/20  Updated: 07/Sep/21  Resolved: 01/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.0, 4.4.1
Fix Version/s: None

Type: Bug Priority: Trivial - P5
Reporter: Žygimantas Stauga Assignee: Eric Sedor
Resolution: Incomplete Votes: 1
Labels: KP44
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Community version. GCP VMs with ubuntu 18.04. 6x mongos, 3x configs, 45 (15 shards) mongod servers.


Attachments: PNG File CleanShot 2020-09-18 at 18.29.54.png     PNG File dns.png     File mongod-diagnostic.data.tar.gz     File mongos-mongodb.diagnostic.data.tar.gz    
Issue Links:
Related
related to SERVER-52932 Increased operation times after upgra... Closed
Operating System: ALL
Participants:

 Description   

We did an upgrade to the 4.4.0 Aug 25th. Right after the upgrade was complete we noticed performance degradation across the entire system. Even queries by sharding key, which returns one document from the small collection (~30k documents), started to degrade in performance. For example, we saw cases where log entry on the mongos side says that the Slow query took 6-7 sec, but we have nothing on the mongod side which means, that query took less than 100ms. We tried to find bottlenecks, tried to temporarily resize instances, add more mongos instances, even disabled TLS, but the results were the same. Then the 4.4.1 version came out, but the upgrade didn’t change anything. So we decided to downgrade to the 4.2 release. At this time, we monitored various system parts as downgrade was performed. And the performance was back as soon as we restarted mongos instances with the 4.2 binaries and stayed at the same levels as we downgraded all the shards one by one.

An interesting thing, that basic server metrics like CPU, Load, Memory, Disk Activity didn’t change at all during upgrade and downgrade. Just mongos instances became slower with version 4.4 for some reason.

Not sure what metrics I can share as we didn't find anything that can show where the problem is. We do have metric history. If someone has an idea of what metrics could show something interesting, there is a chance that we have that, just do not monitor in dashboards. 

The attachment shows avg latency for a query by sharding key in the sharded collection with ~30k documents.  



 Comments   
Comment by Valentin Abalmasov [ 07/Sep/21 ]

What we currently see in mongodb.log of mongos is a lot of slow DNS resolution issues

{"t":{"$date":"2021-09-07T11:48:08.266+00:00"},"s":"W",  "c":"NETWORK",  "id":23019,   "ctx":"establishCursors cleanup","msg":"DNS resolution while connecting to peer was slow","attr":{"peer":"dummy.example.com:27018","durationMillis":1121}}
{"t":{"$date":"2021-09-07T11:57:23.607+00:00"},"s":"W",  "c":"NETWORK",  "id":23019,   "ctx":"establishCursors cleanup","msg":"DNS resolution while connecting to peer was slow","attr":{"peer":"dummy.example.com:27018","durationMillis":1134}}
{"t":{"$date":"2021-09-07T11:57:23.607+00:00"},"s":"W",  "c":"NETWORK",  "id":23019,   "ctx":"establishCursors cleanup","msg":"DNS resolution while connecting to peer was slow","attr":{"peer":"dummy.example.com:27018","durationMillis":1134}}
{"t":{"$date":"2021-09-07T11:57:23.607+00:00"},"s":"W",  "c":"NETWORK",  "id":23019,   "ctx":"establishCursors cleanup","msg":"DNS resolution while connecting to peer was slow","attr":{"peer":"dummy.example.com:27018","durationMillis":1134}}

i did hide the real DNS names but I can confirm that we have all those names in /etc/hosts file and they should be resolved locally

Comment by Valentin Abalmasov [ 07/Sep/21 ]

Hello Eric. We've experienced the same performance degradation issue when upgraded our cluster from 4.2.15-ent to 4.4.8-ent. 
We are using dedicated servers in DC running Ubuntu 20.04 and 18.04 and we have the next setup: 3 cs, 15 shards (PSA), 5 mongos.

I'm attaching diagnostic.data from 1 mongos and from 1 shard's primary mongod. 
Can you please provide me any secure channel where I can upload mongodb.log files, because I don't think it is secure to attach them to ticket, available to public?

We are also going to revert version back to 4.2.15-ent until we understand how can this issue be fixed.

 

Thanks in advance,
Valentin

 

Comment by Eric Sedor [ 01/Jul/21 ]

I'm going to close this ticket for now. But, we are interested in re-opening it if, for an affected mongos router and for a representative shard primary spanning a time period that includes the incident, we can obtain an archive (tar or zip) of:

  • mongod logs
  • $dbpath/diagnostic.data directory (the contents are described here)
Comment by Eric Sedor [ 21/May/21 ]

zhishangyinyu@hotmail.com, ep@tribepayments.com,

We understand time can be precious. If at any point you can provide logs and diagnostic data, we are interested in investigating further.

Gratefully,
Eric

Comment by Eimantas Puskorius [ 23/Apr/21 ]

Same problem on cluster, about 15-30% of performance was lost. Sry for no logs, since currently we do not have time for debugging.

Comment by Eric Sedor [ 20/Apr/21 ]

zhishangyinyu@hotmail.com, if you have information that can help us investigate further, we'd like to examine it in a new SERVER ticket.

For an affected mongos router and for a representative shard primary spanning a time period that includes the incident, please archive (tar or zip) and attach to the ticket the:

  • mongod logs
  • $dbpath/diagnostic.data directory (the contents are described here)

Thank you!

Comment by Žygimantas Stauga [ 12/Apr/21 ]

Hi there,

Sorry about not responding. We were about to enter the Black Friday season and had no time for this issue. We downgraded to the 4.2 version, and there is no way we will risk upgrading to 4.4 again. Log files are long gone. With more people commenting on the ticket, I believe the problem is still there, whatever it is. I'm currently moving to another department and will not get back to this issue. You can close this ticket.

Comment by dong dong [ 12/Apr/21 ]

me too!

response times increase significantly!

Comment by Dave Gotlieb [ 27/Jan/21 ]

Hi Alex, yes that is true, but for the java driver, it appears not:
https://docs.mongodb.com/drivers/java

We are experiencing similar performance issues with v4.4 upgrade with java drivers v3.8

http://s3.amazonaws.com/info-mongodb-com/_com_assets/cms/mongodb-for-giant-ideas-bbab5c3cf8.png<https://docs.mongodb.com/drivers/java>
Java MongoDB Driver — MongoDB Drivers<https://docs.mongodb.com/drivers/java>
The 2.14 driver does not support all MongoDB 3.2 features (e.g., read concern); however, if you are currently on a version 2.x driver and would like to run against MongoDB 3.2 but cannot upgrade to driver version 3.2, use the 2.14 driver.
docs.mongodb.com

 

<EDIT>  I just got off a call with our Atlas engineering support. Our issues are now being tracked as internal to MongoDB cluster communication itself, and not with client connections and requests placed by incompatible driver. We have updated our java driver to 4.1 and the hunt for root cause continues and we are now on mongodb v4.4.3. Sorry to mix Node vs Java driver compatibility.

 

Regards,
Dave Gotlieb
DevOps
480-220-3126
Supervisor: Curtis Broom / Norm Nadeau

Comment by Alexandru Baetu [ 26/Jan/21 ]

I checked the compatibility documentation for node driver and looks like 3.6 version is compatible with mongod 4.4.
https://docs.mongodb.com/drivers/node/compatibility

Comment by Dave Gotlieb [ 26/Jan/21 ]

I note that Alin Dumitru mentions "The client is Node.js and uses mongodb lib version 3.6.2" when he was upgrading to v4.4.1 - per compatibility documentation, v4.4.1 is only compatible with the 4.1 drivers.  

We have been experiencing similar results when upgrading to 4.4. We are in effort to upgrade the java client to 4.1. But, we may still be having performance issues. We have found some new configs are available in 4.4 and 4.4 sends some responses back to the client is a different way. Hence the likely docs stating 4.4 requires 4.1 client version.  

All the above stated. Has anyone resolved the performance issues addressed in this ticket?  We have a similar support ticket open Case #00738730 for our ongoing issues.

Comment by Bruce Lucas (Inactive) [ 13/Oct/20 ]

alin.silvian@gmail.com, can you please open a separate ticket with the details of your issue, as we should not assume at the outset that performance problems experienced by different users are the same underlying issue. When you open that ticket, can you please archive and attach $dbpath/diagnostic.data and a substantial span of logs covering the issue, together with a timeline of when you experienced the issue at the application level.

Comment by Alin Dumitru [ 13/Oct/20 ]

Hello,
I experienced a similar issue.  

  • The service that uses the upgraded DB (from 4.2.8 to 4.4.1 ) had the P50 response times increase from 3-4ms to 23 seconds. 
  • CPU, memory and disk metrics did on the DB server do not seem to differ 
  • !! After the upgrade we could not rollback to 4.2 and had to resync from 4.2 members that were not upgraded yet
  • I can attach a bigger snippet from the redacted logs but they look like this, the duration is very high but < 10 secunds, so the p50 at 23 seconds is an added mistery

{"t":{"$date":"2020-10-01T09:46:25.530+00:00"},"s":"I","c":"COMMAND","id":51803,"ctx":"conn323349","msg":"Slow query","attr":{"durationMillis":2045}}
{"t":{"$date":"2020-10-01T09:46:25.531+00:00"},"s":"I","c":"COMMAND","id":51803,"ctx":"conn510525","msg":"Slow query","attr":{"durationMillis":8544}}
{"t":{"$date":"2020-10-01T09:46:25.531+00:00"},"s":"I","c":"COMMAND","id":51803,"ctx":"conn506358","msg":"Slow query","attr":{"durationMillis":4049}}
{"t":{"$date":"2020-10-01T09:46:25.531+00:00"},"s":"I","c":"COMMAND","id":51803,"ctx":"conn512184","msg":"Slow query","attr":{"durationMillis":1045}}
{"t":{"$date":"2020-10-01T09:46:25.532+00:00"},"s":"W","c":"COMMAND","id":20525,"ctx":"conn506888","msg":"Failed to gather storage statistics for slow operation","attr":{"durationMillis":null}}
{"t":{"$date":"2020-10-01T09:46:25.532+00:00"},"s":"I","c":"COMMAND","id":51803,"ctx":"conn506888","msg":"Slow query","attr":{"durationMillis":8049}}
{"t":{"$date":"2020-10-01T09:46:25.532+00:00"},"s":"I","c":"COMMAND","id":51803,"ctx":"conn3285","msg":"Slow query","attr":{"durationMillis":1549}}

The client is Node.js and uses mongodb lib version 3.6.2 

If you need additional info please let me know. 

Thank you!

Comment by Daniel Pasette (Inactive) [ 12/Oct/20 ]

Hi zygimantas@omnisend.com, are you still experiencing the performance issue?

Comment by Bruce Lucas (Inactive) [ 25/Sep/20 ]

Thanks for the information. That may be a possibility; we'll investigate that on our end.

Meanwhile, do you have jq available, or could you easily install it? If so, would it be possible to run a script like this on the 4.4. log files?

for fn in ...; do
    jq --compact-output '{t,s,c,id,ctx,msg,attr:{durationMillis:.attr.durationMillis}}' $fn >$fn.redacted
done

This will redact the log files of all sensitive information, preserving only the top-level fields which will not have any user-supplied information. You can then attach to the ticket or upload to the secure private portal that I provided above. This will allow us to correlate the slow queries and other events in the logs with the information in the diagnostic.data that you provided.

Comment by Žygimantas Stauga [ 25/Sep/20 ]

Indeed there are quite a lot log entries something like that:

{"t":{"$date":"2020-09-07T08:31:21.585+03:00"},"s":"W",  "c":"NETWORK",  "id":23019,   "ctx":"establishCursors cleanup","msg":"DNS resolution while connecting to peer was slow","attr":{"peer":"production-mongodb-c1-s6-n2-us-central1-b:27018","durationMillis":1402}}

All our hostnames are just a string like production-mongodb-c1-s6-n2-us-central1-b and IP addresses are populated via /etc/hosts

172.21.1.61 production-mongodb-c1-s6-n1-us-central1-a
172.21.1.62 production-mongodb-c1-s6-n2-us-central1-b
172.21.1.63 production-mongodb-c1-s6-n3-us-central1-c

DNS lookups for the hostnames in the /etc/hosts usually are very quick. Just tested on the server and it is ~10ms per lookup. Maybe MongoDB 4.4 started to react differently to such a hostname format as this is not as usual as subdomain.domain.tld.

Comment by Bruce Lucas (Inactive) [ 24/Sep/20 ]

zygimantas@omnisend.com, no problem, let's see how far we can get that way.

We see a couple of pieces of evidence pointing to possible issues with slow DNS in your environment, and maybe increased sensitivity to that in MongoDB 4.4.

  • The number you mention, slow queries taking 6-7 seconds, is suspicious in that we've seen DNS-related issues in the past with that signature due to a 5-second timeout in DNS operations.
  • The metrics that you uploaded showed a substantial number of slow DNS operations, dozens at a time, in semi-regular clumps spaced 1-2 minutes apart, where slow in this context means a DNS request taking more than 1 sec:

    Unfortunately this metric is new in 4.4 so we can't compare it with 4.2.

Can you help us investigate this theory:

  • Do you see "DNS resolution while connecting to peer was slow" messages in the mongos log while running under 4.4, and if so are they correlated with your slow queries? Some (but not all) of the slow DNS operations reflected in the count above are also logged.
  • Can you please look for possible DNS delays using dig or other tool? Since the slowness only occurs episodically you would need to run it repeatedly in a loop for many minutes, looking for occurrences of slow lookups and/or timeouts. As the charts above show this seems to affect all 3 mongos processes.
Comment by Žygimantas Stauga [ 24/Sep/20 ]

If it's possible, we want to go without uploading the logs. Logs contain personal data from our customers and customers of our customers, so we can't just share that.

Comment by Bruce Lucas (Inactive) [ 24/Sep/20 ]

zygimantas@omnisend.com, would you be able to provide the mongos log files covering the time you were running 4.4? If you aren't comfortable attaching them to this ticket, you can upload them to this secure upload portal.

Comment by Žygimantas Stauga [ 24/Sep/20 ]

Ok, found it. Looks like the oldest file is from Sep 11, should be good enough.

EDIT: for some reason, attachments aren't uploading here
https://www.dropbox.com/s/4m67hdon7ehxtak/mongos1.tar.gz?dl=0
https://www.dropbox.com/s/py45gq0wvd34g84/mongos2.tar.gz?dl=0
https://www.dropbox.com/s/46fd6ftb1exo18s/mongos3.tar.gz?dl=0

Comment by Daniel Pasette (Inactive) [ 23/Sep/20 ]

See this docs on diagnostic data for mongos (available starting in v3.6):

https://docs.mongodb.com/manual/reference/parameters/#diagnostic-parameters

~~~~~~~~~~~~~~

For mongos, the diagnostic data files, by default, are stored in a directory under the mongos instance’s --logpath or systemLog.path directory. The diagnostic data directory is computed by truncating the logpath’s file extension(s) and concatenating diagnostic.data to the remaining name.

For example, if mongos has --logpath /var/log/mongodb/mongos.log.201708015, then the diagnostic data directory is /var/log/mongodb/mongos.diagnostic.data/ directory. To specify a different diagnostic data directory for mongos, set the diagnosticDataCollectionDirectoryPath parameter.

Comment by Žygimantas Stauga [ 23/Sep/20 ]

AFAIK mongos doesn't have dbpath.
https://docs.mongodb.com/manual/reference/configuration-options/#storage.dbPath The storage.dbPath setting is available only for mongod.
File search on a disk didn't find anything with the name "diagnostic.data". Either way, I looked into random shard diagnostic.data, and the oldest file was from Sep 18. The downgrade was performed on Sep 16. So we are outside the window.
In theory, I can upgrade one or two mongos instances to 4.4 and gather the required data, but still, I have to know to configure mongos to store diagnostic.data

Comment by Eric Sedor [ 23/Sep/20 ]

zygimantas@omnisend.com, as long as the downgrade was performed in place, the data Dan describes should still be there. As long as the downgrade occurred within a few days, it should contain data for both versions.

Can you tar or zip the $dbpath/diagnostic.data directory for a mongos and attach it to this ticket? This is the information we'd need to investigate.

Gratefully,
Eric

Comment by Žygimantas Stauga [ 23/Sep/20 ]

The entire cluster is already downgraded. Shards itself weren't affected, mongos was. As soon we restarted mongos with the old binaries, the degradation was gone. Shards at that point were still running 4.4.x.

Comment by Daniel Pasette (Inactive) [ 23/Sep/20 ]

Can you attach FTDC and run log files for the primary of one of the shards that was impacted. Hopefully we’ll be able to see a comparison of the stats between 4.4 and 4.2.

Generated at Thu Feb 08 05:24:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.