[SERVER-52932] Increased operation times after upgrade to 4.4.1 Created: 18/Nov/20  Updated: 02/Jun/22  Resolved: 17/May/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.1
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Alin Dumitru Assignee: Bruce Lucas (Inactive)
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Docker containers on Ubuntu 16.04 AWS VMs


Attachments: PNG File Screenshot 2020-11-19 at 17.56.04.png     File diag.data.tgz     File diagnostic.tgz     PNG File lock.png     File mongod-diagnostic.data.tar.gz     File mongos-mongodb.diagnostic.data.tar.gz    
Issue Links:
Related
is related to SERVER-51104 Performance degradation on mongos 4.4.x Closed
Operating System: ALL
Participants:

 Description   

We upgraded 2 replica sets from 4.2.8 to 4.4.1 and saw immediate performance degradation as query times went from milliseconds to seconds

  • The service that uses the upgraded DB (from 4.2.8 to 4.4.1 ) had the P50 response times increase from 3-4ms to 23 seconds. 
  • CPU, memory and disk metrics did on the DB server do not seem to differ 
  • After the upgrade we could not rollback to 4.2 and had to resync from 4.2 members that were not upgraded yet

 

{"t":\{"$date":"2020-10-01T09:46:25.530+00:00"}

,"s":"I","c":"COMMAND","id":51803,"ctx":"conn323349","msg":"Slow query","attr":{"durationMillis":2045}}

{"t":\{"$date":"2020-10-01T09:46:25.531+00:00"}

,"s":"I","c":"COMMAND","id":51803,"ctx":"conn510525","msg":"Slow query","attr":{"durationMillis":8544}}

{"t":\{"$date":"2020-10-01T09:46:25.531+00:00"}

,"s":"I","c":"COMMAND","id":51803,"ctx":"conn506358","msg":"Slow query","attr":{"durationMillis":4049}}

{"t":\{"$date":"2020-10-01T09:46:25.531+00:00"}

,"s":"I","c":"COMMAND","id":51803,"ctx":"conn512184","msg":"Slow query","attr":{"durationMillis":1045}}

{"t":\{"$date":"2020-10-01T09:46:25.532+00:00"}

,"s":"W","c":"COMMAND","id":20525,"ctx":"conn506888","msg":"Failed to gather storage statistics for slow operation","attr":{"durationMillis":null}}

{"t":\{"$date":"2020-10-01T09:46:25.532+00:00"}

,"s":"I","c":"COMMAND","id":51803,"ctx":"conn506888","msg":"Slow query","attr":{"durationMillis":8049}}

{"t":\{"$date":"2020-10-01T09:46:25.532+00:00"}

,"s":"I","c":"COMMAND","id":51803,"ctx":"conn3285","msg":"Slow query","attr":{"durationMillis":1549}}

 I'll attach diagnostic.data from the runs with version 4.4.1 

The client is Node.js and uses mongodb lib version 3.6.2 

If you need additional info please let me know. 



 Comments   
Comment by Bruce Lucas (Inactive) [ 17/May/22 ]

aljursav@gmail.com , valentin@plarium.com - sorry for the delay in responding. Can you please open separate tickets with supporting data, as it is possible you are encountering a different issue and it helps us to keep tickets specific to an issue.

alin.silvian@gmail.com it has been some time since we've heard from you, so I'll close this ticket as "incomplete". Feel free to re-open it or open a new ticket with the requested data if you are still interested in pursuing this issue.

Comment by Alex Savelev [ 31/Jan/22 ]

Good day!
Faced the same problem in version 4.4.8.
Logs:

{"t":

{"$date":"2022-01-31T14:08:13.591+02:00"}

,"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn11144","msg":"Slow query","attr":{"type":"command","ns":"Viss_DAIRM2.auditLogEntries","command":{"insert":"auditLogEntries","ordered":true,"bypassDocumentValidation":false,"writeConcern":

{"w":"majority","j":true}

,"txnNumber":27,"$db":"Viss_DAIRM2","lsid":{"id":{"$uuid":"3b5d4fa2-fa00-4c5b-afd3-0b0a0afb0191"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1643630891,"i":6}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}}},"ninserted":1,"keysInserted":3,"numYields":0,"reslen":230,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":2}},"ReplicationStateTransition":{"acquireCount":{"w":3}},"Global":{"acquireCount":{"w":2}},"Database":{"acquireCount":{"w":2}},"Collection":{"acquireCount":{"w":2}},"Mutex":{"acquireCount":

{"r":2}

}},"flowControl":{"acquireCount":1,"timeAcquiringMicros":1},"writeConcern":{"w":"majority","j":true,"wtimeout":0,"provenance":"clientSupplied"},"storage":{"data":{"bytesRead":8461,"timeReadingMicros":46369}},"protocol":"op_msg","durationMillis":100}}

Maybe there is some solution to this problem?
Thanks!

Comment by Valentin Abalmasov [ 07/Sep/21 ]

What we currently see in mongodb.log of mongos is a lot of slow DNS resolution issues

{"t":{"$date":"2021-09-07T11:48:08.266+00:00"},"s":"W",  "c":"NETWORK",  "id":23019,   "ctx":"establishCursors cleanup","msg":"DNS resolution while connecting to peer was slow","attr":{"peer":"dummy.example.com:27018","durationMillis":1121}}
{"t":{"$date":"2021-09-07T11:57:23.607+00:00"},"s":"W",  "c":"NETWORK",  "id":23019,   "ctx":"establishCursors cleanup","msg":"DNS resolution while connecting to peer was slow","attr":{"peer":"dummy.example.com:27018","durationMillis":1134}}
{"t":{"$date":"2021-09-07T11:57:23.607+00:00"},"s":"W",  "c":"NETWORK",  "id":23019,   "ctx":"establishCursors cleanup","msg":"DNS resolution while connecting to peer was slow","attr":{"peer":"dummy.example.com:27018","durationMillis":1134}}
{"t":{"$date":"2021-09-07T11:57:23.607+00:00"},"s":"W",  "c":"NETWORK",  "id":23019,   "ctx":"establishCursors cleanup","msg":"DNS resolution while connecting to peer was slow","attr":{"peer":"dummy.example.com:27018","durationMillis":1134}}

i did hide the real DNS names but I can confirm that we have all those names in /etc/hosts file and they should be resolved locally

Comment by Valentin Abalmasov [ 07/Sep/21 ]

Hi guys. We are experiencing the same problem with performance degradation after upgraded from 4.2.15-ent to 4.4.8-ent.

I have uploaded required monogdb.log and diagnostic.data files to related ticket https://jira.mongodb.org/browse/SERVER-51104. Please, tell me if I can help you diagnosing the problem further.

Attaching diagnostic data from MongoS and primary MongoD.
Please, provide me any channel where to upload mongodb.log files, because I don't want them to be available to public.

 

Thanks in advance,
Valentin

Comment by Alin Dumitru [ 22/Apr/21 ]

Hi. The diagnostic data was from a node showing the issue. Unfortunately I did not perform a new test because this can be reproduced only on intensive production DB servers and we did have issues with rollback the first time, which resulted in an extended downtime. I do plan to replicate the test because, as far as  I can  tell, this affects versions as far as 4.4.4. I'll get back to you as soon I have a safe setup for this scenario.

Comment by Eric Sedor [ 20/Apr/21 ]

Hi alin.silvian@gmail.com, I wanted to follow up on bruce.lucas's last request. Are you able to provide any of the requested information?

Comment by Bruce Lucas (Inactive) [ 19/Nov/20 ]

Thanks. I do see some subtle changes in metrics at the time when you saw the increased latency, but nothing to suggest to me why you might be seeing a latency tail of tens of seconds (assuming I'm reading your chart correctly).

Some more questions:

  • Can you confirm that the diagnostic data that you attached is from a node showing the issue?
  • Are you able to attach the log file for the problematic node so we can see specifically what queries are slow? (If you don't want to share it on this public ticket you can upload it to this secure private portal).
  • Can you provide comparison data for node(s) that had comparable load but did not experience the issue, both for 4.2 and other 4.4.2 nodes if applicable?
Comment by Alin Dumitru [ 19/Nov/20 ]

Sure. It’s GMT + 2

Comment by Bruce Lucas (Inactive) [ 19/Nov/20 ]

Thanks, I will take a look and let you know. Can you tell me what the timezone for that chart is?

Comment by Alin Dumitru [ 19/Nov/20 ]

Hello, 
First of all thank you for the quick response yesterday!
I have upgraded some servers from a replica set to 4.4.2. It seems to work for some of our replicas but after hitting a certain number of ops/sec I started seeing  increased p90  and p99 times on 2 of them.  I've attached new diagnotisc.data and a screenshot from our monitoring that shows the increasing times.

diag.data.tgz

They reverted to normal after switching  back to 4.2.8. 
Is this the same issue or should I create a new ticket?
Thank you!

Comment by Bruce Lucas (Inactive) [ 18/Nov/20 ]

alin.silvian@gmail.com the metrics show symptoms of a problematic lock in the storage layer:

Starting from A we see a large number of threads waiting on the "read timestamp queue lock", accompanied by a high rate of context switches and system cpu utilization, which is characteristic of an inefficient spin lock.

We addressed this problem in 4.4.2 with SERVER-51041; would you be able to try again with 4.4.2 and let us know whether it fixes your problem? If not please attach another archive of diagnostic.data.

Generated at Thu Feb 08 05:29:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.