[SERVER-72978] Random high query response and very high CPU on running rs combination with 4.2.20 and 4.4.18 versions Created: 18/Jan/23  Updated: 17/Mar/23

Status: Investigating
Project: Core Server
Component/s: Performance, Querying, WiredTiger
Affects Version/s: 4.2.20, 4.4.18
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: KAPIL GUPTA Assignee: Chris Kelly
Resolution: Unresolved Votes: 25
Labels: Bug
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

java client driver on client side - 3.12.9
mongo version - 4.2.20 (for 2 data members)
mongo version - 4.4.18 (for 2 data members)
Storage Engine - WT
Storage Type - tmpfs
Replica-set : 7 members (4 non-arbiter and 3 arbiter , all voting members)
One of the member CMD as an example:
mongod --keyFile=/mongodb.key --enableMajorityReadConcern false --ipv6 --bind_ip_all --port 27021 --dbpath=/data/db/wt-27021 --replSet rs-shard-2 --quiet --slowms 500 --logpath /data/db/mongo-27021.log --setParameter diagnosticDataCollectionEnabled=true --logappend --oplogSize 3221 --logRotate reopen --wiredTigerCacheSizeGB 4.40


Issue Links:
Related
is related to SERVER-67217 High query response on primary (3-4 s... Closed
Operating System: ALL
Steps To Reproduce:

1- run 2 data members with 4.2.20 and 2 data members with 4.4.18

2- put medium to high load(read/writes) on rs and observe

Participants:

 Description   

Scenario - We run tests with same load on 2 mongo versions(4.2.20,4.4.20).
We found high query response(in range of 3-4 seconds)  and very high cpu(100%) on primary for 4.2.20 version but no issue was found on 4.4.18

  • As per the suggestion in Server-67217, we also collected disk usage but we did not find any high disk read/write at the issue time.
  • In Server-67217, we did same test between (4.2.20 and 4.0.27) and still issue was found on 4.2.20 primary members only.
  • Kindly refer Server-67217 for more details about the issue.

Note: Please provide cloud link to upload respective logs.



 Comments   
Comment by KAPIL GUPTA [ 17/Mar/23 ]

Gentle Reminder 2!

 

Regards,

Kapil

Comment by KAPIL GUPTA [ 03/Mar/23 ]

Hi Chris,

Gentle Reminder!

Did you get anything on this?

 

Regards,

Kapil

Comment by KAPIL GUPTA [ 23/Feb/23 ]

Hi Chris,

I have uploaded the logs. Please find the logs file detail given below.

Parent file name: Logs.zip

child files:

cps@vpas-B-persistence-db-9:~$ tar -tf PrimaryLogs.tar.gz
diagnostic.data/
diagnostic.data/metrics.interim
diagnostic.data/metrics.2023-02-22T11-44-24Z-00000
journalctlPrimary.logs
kernalDmesgLogs
mongo-27040.log 
mongo-27040.log.1
mongo-27040.log_verbose

NOTE: high cpu is seen multiple times(almost every 2 minutes) from 22.02.23 12:45:00 to 22.02.23 13:45:00 (mentioned in mongo-27040.log and mongo-27040.log.1) and also I enabled log verbosity from 2023-02-22T13:09:21.295+0000 (mentioned in mongo-27040.log_verbose ) to get more logs

cps@vpas-B-persistence-db-10:~$ tar -tf Secondary4.2Logs.tar.gz
diagnostic.data/
diagnostic.data/metrics.interim
diagnostic.data/metrics.2023-02-22T13-09-06Z-00000
diagnostic.data/metrics.2023-02-22T04-56-59Z-00000
mongo-27040.log

cps@vpas-A-persistence-db-10:~$ sudo tar -tf Secondary4.4Logs.tar.gz
diagnostic.data/
diagnostic.data/metrics.interim
diagnostic.data/metrics.2023-02-22T09-59-46Z-00000
diagnostic.data/metrics.2023-02-22T00-13-58Z-00000
mongo-27040.log

cps@vpas-A-persistence-db-9:~$ sudo tar -tf Secondary_2_4.4Logs.tar.gz
diagnostic.data/
diagnostic.data/metrics.interim
diagnostic.data/metrics.2023-02-22T00-14-27Z-00000
diagnostic.data/metrics.2023-02-22T09-59-46Z-00000
mongo-27040.log

Thanks,

Kapil

Comment by Chris Kelly [ 22/Feb/23 ]

Hi Kapil,

I've created a new secure upload portal link for you. 

Christopher

Comment by KAPIL GUPTA [ 22/Feb/23 ]

Hi Chris,

Sorry for delay as we lost our last logs. we recreated the issue now but unfortunately the upload link has been expired.

As per your suggestion, we checked that feasibility( have lower version secondaries)as well but  here scenario is like -[1 primary and 1 secondary on 4.2 and 2 secondaries on 4.4]also we tested same with (SERVER-67217 )[ 1 primary and 1 secondary on 4.0 and 2 secondaries on  4.2 ].

Each time we got high cpu issue on 4.2 primary member only.

Kindly provide the new upload link to upload the logs.

 

Thanks,
Kapil.

 

Comment by Chris Kelly [ 08/Feb/23 ]

We still need additional information to diagnose the problem. If this is still an issue for you, would you please supply the requested information if possible?

Also, just as an FYI, it seems like you are pointing out that you are observing latency on a higher version primary node, when you have lower version secondaries. Per the upgrade steps, you should be upgrading these secondaries first to the higher version, and then finishing your upgrade by stepping down the primary (which should be the last one to be upgraded)

Effectively, this sounds like it should mitigate your problem. If it does not, please provide further detail regarding this process (and the requested information above). Thanks!

Comment by Chris Kelly [ 19/Jan/23 ]

Hi Kapil,

I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Additionally, if you have a test driver that reproduces the workload you're having issues with, that would be significantly helpful in pinning down what may be occurring here.

Christopher

Generated at Thu Feb 08 06:23:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.