[SERVER-75881] mongod 4.4 CPU hogs/stalls leading to client queries timedout Created: 08/Apr/23  Updated: 31/May/23  Resolved: 31/May/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: venkataramans rama Assignee: Yuan Fang
Resolution: Done Votes: 24
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File 4.0_Primary_Startup_mongodb-27717.log     File 4.0_parameters     File 4.2_parameters     PNG File Screenshot 2023-04-10 at 5.52.06 PM.png     PNG File Screenshot 2023-04-11 at 4.31.44 PM.png     Text File clientlog.log     JPEG File clientsettings.jpg     File mongo_4.0_diag.tar.gz     Text File mongodb-27717.log    
Operating System: ALL
Participants:
Case:

 Description   

Hi Team,

Since migrating to MongoDB 4.4, we have been encountering a recurring issue of client queries timing out frequently. Upon monitoring the CPU usage during these occurrences, we have noticed that mongod's CPU usage is unusually high, despite there not being a significant increase in operations.

 

This erratic CPU usage has been observed intermittently, and our application has a monitoring thread that checks the replicStatus and reports NO_PRIMARY during such times.

 

Mongo- Server Version: 4.4.18

Mongo Client Java Driver version: 3.12.9

Standalone or Replica : 4 data bearing + 1 arbiter

Server Option: 

root       52039  161  2.5 12649240 7229872 ?    S<Ll Apr06 4415:45 /usr/bin/mongod --ipv6 --slowms 500 --storageEngine wiredTiger --wiredTigerCacheSizeGB 22 --enableMajorityReadConcern false --bind_ip_all --keyFile /root/.dbkey --port 27717 --dbpath=/var/data/sessions.1/h --replSet set01h --fork --pidfilepath /var/run/sessionmgr-27717.pid --oplogSize 5120 --logpath /var/log/mongodb-27717.log --logappend --quiet

Storage Engine : WT

 

Application Monitors the replicaStatus periodically reporting, TYPE=UNKNOWN for PRIMARY intermittently. 

2023-04-07 23:02:30,944 [pool-2-thread-1] replicaset status: ReplicaSetStatus{name=set01h, cluster=ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=nd2bwa3psm16vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion

{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=795436, setName='set01h', canonicalAddress=nd2bwa3psm16vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414267859844}, ServerDescription{address=nd2bwa3psm15va:27717, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=nd2bwa3psm16va:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}

, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1152849, setName='set01h', canonicalAddress=nd2bwa3psm16va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414649210810}, ServerDescription{address=nd2bwa3psm15vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion

{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=943462, setName='set01h', canonicalAddress=nd2bwa3psm15vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414619113856}]}}

Immediately in the next sec, the replicaSet detects PRIMARY

2023-04-07 23:02:31,260 [pool-2-thread-1] replicaset status: ReplicaSetStatus{name=set01h, cluster=ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=nd2bwa3psm16vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}

, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=772997, setName='set01h', canonicalAddress=nd2bwa3psm16vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116414767866623}, ServerDescription{address=nd2bwa3psm15va:27717, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=16803136, setName='set01h', canonicalAddress=nd2bwa3psm15va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=7fffffff0000000000000080, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116414870305431}, ServerDescription{address=nd2bwa3psm16va:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion

{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1152849, setName='set01h', canonicalAddress=nd2bwa3psm16va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414649210810}, ServerDescription{address=nd2bwa3psm15vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}

, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=943462, setName='set01h', canonicalAddress=nd2bwa3psm15vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414619113856}]}}

 

But again moves to UKNOWN and later to a proper state

2023-04-07 23:02:31,544 [pool-2-thread-1] replicaset status: ReplicaSetStatus{name=set01h, cluster=ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=nd2bwa3psm16vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion

{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=729934, setName='set01h', canonicalAddress=nd2bwa3psm16vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415267757469}, ServerDescription{address=nd2bwa3psm15va:27717, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=nd2bwa3psm16va:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}

, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1059673, setName='set01h', canonicalAddress=nd2bwa3psm16va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415149889939}, ServerDescription{address=nd2bwa3psm15vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion

{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=941532, setName='set01h', canonicalAddress=nd2bwa3psm15vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415120136591}]}}

2023-04-07 23:02:31,764 [pool-3-thread-1] replicaset status: ReplicaSetStatus{name=set01h, cluster=ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=nd2bwa3psm16vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}

, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=729934, setName='set01h', canonicalAddress=nd2bwa3psm16vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415267757469}, ServerDescription{address=nd2bwa3psm15va:27717, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=13549634, setName='set01h', canonicalAddress=nd2bwa3psm15va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=7fffffff0000000000000080, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415370985974}, ServerDescription{address=nd2bwa3psm16va:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion

{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1059673, setName='set01h', canonicalAddress=nd2bwa3psm16va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415149889939}, ServerDescription{address=nd2bwa3psm15vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}

, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=941532, setName='set01h', canonicalAddress=nd2bwa3psm15vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415120136591}]}}

top - 23:02:27 up 1 day,  7:17,  0 users,  load average: 21.07, 23.62, 26.89
Tasks: 303 total,   1 running, 301 sleeping,   0 stopped,   1 zombie
%Cpu(s): 56.5 us, 15.2 sy,  0.0 ni, 22.7 id,  0.0 wa,  2.1 hi,  3.5 si,  0.0 st
MiB Mem : 273809.9 total, 204977.1 free,  30793.4 used,  38039.4 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used. 205453.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  52039 root       5 -15   12.0g   7.2g  36920 S 237.7   2.7   2977:27 mongod


top - 23:02:30 up 1 day,  7:17,  0 users,  load average: 21.07, 23.62, 26.89
Tasks: 301 total,   2 running, 299 sleeping,   0 stopped,   0 zombie
%Cpu(s): 87.8 us,  4.8 sy,  0.0 ni,  5.3 id,  0.0 wa,  0.9 hi,  1.2 si,  0.0 st
MiB Mem : 273809.9 total, 204977.1 free,  30793.4 used,  38039.4 buff/cache
MiB Swap:   4096.0 total,   4096.0 free,      0.0 used. 205453.0 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  52039 root       5 -15   12.1g   7.3g  36920 S 999.9   2.7   2977:48 mongod

PID : 52039 mongostats
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
    65  1449    969     68     701   821|0  1.9% 30.2%       0 12.0G 7.19G 0|1 1|1  5.79m   14.8m  741 set01h  PRI Apr  7 23:02:21.449
    71  1572   1063     76     870   959|0  1.7% 30.2%       0 12.0G 7.20G 0|0 2|0  6.45m   16.2m  744 set01h  PRI Apr  7 23:02:22.426
    69  1581   1064     74     841   988|0  1.8% 30.2%       0 12.0G 7.20G 0|0 1|0  6.53m   16.2m  744 set01h  PRI Apr  7 23:02:23.423
    88  1602   1090     77     793   870|0  1.8% 30.3%       0 12.0G 7.20G 0|0 1|1  6.55m   16.5m  741 set01h  PRI Apr  7 23:02:24.431
    67  1507   1026     78     954  1013|0  1.9% 30.1%       0 12.0G 7.20G 0|0 1|1  6.44m   15.8m  741 set01h  PRI Apr  7 23:02:25.427
    59  1601   1122     86     778   817|0  1.9% 30.1%       0 12.0G 7.19G 0|1 1|0  6.48m   16.6m  741 set01h  PRI Apr  7 23:02:26.419
    81  1609   1068     73     779   908|0  2.0% 30.2%       0 12.0G 7.19G 0|1 1|0  6.46m   16.3m  741 set01h  PRI Apr  7 23:02:27.435
    15   417    189     32     225   364|0  1.9% 30.1%       0 11.9G 7.31G 0|0 2|8  1.44m   2.93m  612 set01h  PRI Apr  7 23:02:30.352
    *0   184     *0      9      18  2643|0  1.9% 30.2%       0 11.8G 7.32G 0|0 1|1   318k   2.89m  413 set01h  PRI Apr  7 23:02:30.564
   140  1499    809     94     364  6105|0  2.0% 30.2%       0 11.6G 7.31G 0|0 1|5  5.90m   17.7m  292 set01h  PRI Apr  7 23:02:31.422
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time

 

The response time goes to minutes

                     ns    total     read    write    2023-04-07T23:02:27Z
session_cache_2.session    321ms     86ms    235ms
  session_cache.session    289ms     63ms    225ms
         local.oplog.rs    274ms    274ms      0ms
session_cache_4.session    251ms     67ms    184ms
session_cache_3.session    223ms     58ms    165ms
      admin.system.keys      0ms      0ms      0ms
     admin.system.roles      0ms      0ms      0ms
     admin.system.users      0ms      0ms      0ms
   admin.system.version      0ms      0ms      0ms
config.image_collection      0ms      0ms      0ms

                     ns    total     read    write    2023-04-07T23:02:29Z
         local.oplog.rs    147ms    147ms      0ms
session_cache_2.session     96ms     27ms     68ms
session_cache_3.session     73ms     20ms     53ms
  session_cache.session     71ms     17ms     53ms
session_cache_4.session     63ms     14ms     48ms
      admin.system.keys      0ms      0ms      0ms
     admin.system.roles      0ms      0ms      0ms
     admin.system.users      0ms      0ms      0ms
   admin.system.version      0ms      0ms      0ms
config.image_collection      0ms      0ms      0ms

                     ns       total        read      write    2023-04-07T23:02:30Z
session_cache_4.session    187974ms    126810ms    61164ms
  session_cache.session    159628ms    116939ms    42689ms
session_cache_2.session    157818ms    114013ms    43804ms
session_cache_3.session    132527ms     88974ms    43552ms
         local.oplog.rs      4639ms      4639ms        0ms
      admin.system.keys         0ms         0ms        0ms
     admin.system.roles         0ms         0ms        0ms
     admin.system.users         0ms         0ms        0ms
   admin.system.version         0ms         0ms        0ms
config.image_collection         0ms         0ms        0ms

                     ns     total     read     write    2023-04-07T23:02:31Z
  session_cache.session    1541ms    374ms    1166ms
session_cache_3.session    1306ms    294ms    1011ms
session_cache_2.session    1266ms    247ms    1018ms
session_cache_4.session    1064ms    232ms     831ms
         local.oplog.rs     338ms    338ms       0ms
      admin.system.keys       0ms      0ms       0ms
     admin.system.roles       0ms      0ms       0ms
     admin.system.users       0ms      0ms       0ms
   admin.system.version       0ms      0ms       0ms
config.image_collection       0ms      0ms       0ms

                     ns     total     read     write    2023-04-07T23:02:32Z
session_cache_4.session    1453ms    178ms    1274ms
  session_cache.session    1367ms    188ms    1178ms
session_cache_3.session    1303ms     95ms    1207ms
session_cache_2.session    1210ms    175ms    1034ms
         local.oplog.rs     273ms    273ms       0ms
      admin.system.keys       0ms      0ms       0ms
     admin.system.roles       0ms      0ms       0ms
     admin.system.users       0ms      0ms       0ms
   admin.system.version       0ms      0ms       0ms
config.image_collection       0ms      0ms       0ms

                     ns     total     read     write    2023-04-07T23:02:33Z
session_cache_2.session    1407ms    116ms    1291ms
session_cache_4.session    1188ms    302ms     886ms
session_cache_3.session    1147ms     79ms    1068ms
  session_cache.session     839ms     92ms     746ms
         local.oplog.rs     280ms    280ms       0ms
      admin.system.keys       0ms      0ms       0ms
     admin.system.roles       0ms      0ms       0ms
     admin.system.users       0ms      0ms       0ms
   admin.system.version       0ms      0ms       0ms
config.image_collection       0ms      0ms       0ms

                     ns     total     read     write    2023-04-07T23:02:34Z
session_cache_2.session    1230ms     87ms    1142ms
session_cache_3.session    1208ms    136ms    1071ms
  session_cache.session    1133ms     56ms    1076ms
session_cache_4.session    1118ms     65ms    1053ms
         local.oplog.rs     347ms    347ms       0ms
      admin.system.keys       0ms      0ms       0ms
     admin.system.roles       0ms      0ms       0ms
     admin.system.users       0ms      0ms       0ms
   admin.system.version       0ms      0ms       0ms
config.image_collection       0ms      0ms       0ms

 

CPU idle% goes to 0.0
Linux 4.18.0-372.26.1.el8_6.x86_64 (nd2bwa3psm15va)     04/07/2023      x86_64        (10 CPU)

11:02:09 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
11:02:10 PM     all     69.90      0.00     21.21      0.00      0.00      8.89
11:02:11 PM     all     65.22      0.00     22.38      0.00      0.00     12.40
11:02:12 PM     all     54.65      0.00     20.63      0.00      0.00     24.72
11:02:13 PM     all     53.48      0.00     21.00      0.00      0.00     25.51
11:02:14 PM     all     50.76      0.00     23.45      0.00      0.00     25.79
11:02:15 PM     all     56.42      0.00     21.44      0.00      0.00     22.14
11:02:16 PM     all     56.44      0.00     22.49      0.00      0.00     21.06
11:02:17 PM     all     55.44      0.00     21.15      0.00      0.00     23.41
11:02:18 PM     all     52.41      0.00     20.16      0.00      0.00     27.43
11:02:19 PM     all     54.22      0.00     21.67      0.00      0.00     24.11
11:02:20 PM     all     52.24      0.00     20.88      0.00      0.00     26.88
11:02:21 PM     all     61.88      0.00     21.44      0.00      0.00     16.68
11:02:22 PM     all     62.61      0.00     21.78      0.00      0.00     15.60
11:02:23 PM     all     50.36      0.00     19.16      0.00      0.00     30.48
11:02:24 PM     all     54.06      0.00     22.72      0.00      0.00     23.23
11:02:25 PM     all     55.25      0.00     20.39      0.00      0.00     24.36
11:02:26 PM     all     50.36      0.00     18.04      0.00      0.00     31.60
11:02:27 PM     all     52.86      0.00     19.29      0.00      0.00     27.86
11:02:28 PM     all     67.58      0.00     15.15      0.00      0.00     17.27
11:02:29 PM     all     97.20      0.00      2.80      0.00      0.00      0.00
11:02:30 PM     all     88.59      0.00      9.11      0.00      0.00      2.30
11:02:31 PM     all     59.90      0.00     26.40      0.00      0.00     13.71
11:02:32 PM     all     63.96      0.00     23.49      0.00      0.00     12.55
11:02:33 PM     all     70.56      0.00     21.88      0.00      0.00      7.56
11:02:34 PM     all     65.70      0.00     23.00      0.00      0.00     11.30

Linux 4.18.0-372.26.1.el8_6.x86_64 (nd2bwa3psm15va)     04/07/2023      x86_64        (10 CPU)

11:02:09 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
11:02:10 PM        28      2985     22.28     23.99     27.08         0
11:02:11 PM        17      2994     22.28     23.99     27.08         0
11:02:12 PM        24      3025     22.28     23.99     27.08         0
11:02:13 PM        10      3024     22.58     24.02     27.07         0
11:02:14 PM         6      2956     22.58     24.02     27.07         0
11:02:15 PM        25      2959     22.58     24.02     27.07         0
11:02:16 PM        37      2960     22.58     24.02     27.07         0
11:02:17 PM         7      2964     22.58     24.02     27.07         0
11:02:18 PM        10      2968     21.89     23.86     27.00         0
11:02:19 PM        21      2968     21.89     23.86     27.00         0
11:02:20 PM        25      3001     21.89     23.86     27.00         0
11:02:21 PM        35      2999     21.89     23.86     27.00         0
11:02:22 PM        20      3037     21.89     23.86     27.00         0
11:02:23 PM        23      3037     21.42     23.73     26.94         0
11:02:24 PM        29      2983     21.42     23.73     26.94         0
11:02:25 PM        11      2981     21.42     23.73     26.94         0
11:02:26 PM        18      2971     21.42     23.73     26.94         0
11:02:27 PM        13      2972     21.42     23.73     26.94         0
11:02:28 PM       300      2985     21.07     23.62     26.89         0
11:02:29 PM       741      3160     21.07     23.62     26.89         0
11:02:30 PM       115      2612     21.07     23.62     26.89         0

 

SM16va: SECONDARY member mongostat

insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
   *69    18  *1018    *83       0   120|0  0.9% 44.1%       0 13.0G 10.7G 0|0 1|0  25.2k    170k  153 set01h  SEC Apr  7 23:02:15.245
   *67    23   *935    *79       0   134|0  1.0% 44.1%       0 13.0G 10.7G 0|0 1|0  29.4k    167k  153 set01h  SEC Apr  7 23:02:16.231
   *71    24   *984    *67       0   132|0  1.0% 44.1%       0 13.0G 10.7G 0|0 1|0  29.5k    229k  156 set01h  SEC Apr  7 23:02:17.227
   *67    27  *1061    *72       0   136|0  1.0% 44.0%       0 13.0G 10.7G 0|0 1|0  31.2k    263k  153 set01h  SEC Apr  7 23:02:18.229
   *76    24  *1051    *65       0   136|0  1.0% 44.0%       0 13.0G 10.7G 0|0 1|0  31.0k    170k  156 set01h  SEC Apr  7 23:02:19.226
   *59    28  *1028    *65       0   128|0  1.0% 44.0%       0 13.0G 10.7G 0|0 1|0  29.7k    166k  156 set01h  SEC Apr  7 23:02:20.225
   *67    27   *934    *65       0   123|0  1.1% 43.9%       0 13.0G 10.7G 0|0 1|0  28.7k    157k  153 set01h  SEC Apr  7 23:02:21.232
   *64    28  *1079    *72       0   125|0  1.1% 43.9%       0 13.0G 10.7G 0|0 2|0  29.2k    166k  153 set01h  SEC Apr  7 23:02:22.226
   *70    22   *900    *60       0   118|0  1.1% 43.9%       0 13.0G 10.7G 0|0 1|0  26.1k    158k  153 set01h  SEC Apr  7 23:02:23.249
   *83    29  *1140    *69       0   140|0  1.1% 43.9%       0 13.0G 10.7G 0|1 1|0  32.8k    180k  153 set01h  SEC Apr  7 23:02:24.235
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
   *80    35  *1039    *74       0   121|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0  30.9k    169k  153 set01h  SEC Apr  7 23:02:25.240
   *50    28   *993    *78       0   127|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0  29.5k    163k  153 set01h  SEC Apr  7 23:02:26.227
   *85    33  *1113    *81       0   129|0  1.1% 43.8%       0 13.0G 10.7G 0|1 1|0  31.6k    162k  156 set01h  SEC Apr  7 23:02:27.251
   *66    16   *653    *74       0   129|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0  26.3k    212k  156 set01h  SEC Apr  7 23:02:28.223
    *0     5     *0     *0       0   149|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0  27.9k    310k  156 set01h  SEC Apr  7 23:02:29.229
    *0   261     *0     *0       0   126|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0   123k   1.23m  159 set01h  SEC Apr  7 23:02:30.227
   *64   947   *564    *96       0   175|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0   400k   3.81m  171 set01h  SEC Apr  7 23:02:31.238
   *87   251   *933    *82       0   115|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0   116k   1.13m  170 set01h  SEC Apr  7 23:02:32.231
  *124    41  *1562   *155       0   128|0  1.2% 43.6%       0 13.0G 10.7G 0|0 1|0  34.0k    198k  170 set01h  SEC Apr  7 23:02:33.227
  *119    33  *1553    *71       0   124|0  1.2% 43.6%       0 13.0G 10.7G 0|0 1|0  31.2k    179k  170 set01h  SEC Apr  7 23:02:34.236
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time

 

15vb:SECONDARY member mongostat

insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
   *61    29   *952    *77     434   532|0  0.9% 42.7%       0 13.1G 9.96G 0|0 1|0   733k   4.49m  158 set01h  SEC Apr  7 23:02:15.917
   *74    20   *978    *63     369   413|0  0.9% 42.7%       0 13.1G 9.96G 0|0 1|0   540k   7.30m  161 set01h  SEC Apr  7 23:02:16.909
   *64    24  *1010    *77     342   465|0  0.9% 42.8%       0 13.1G 9.96G 0|0 1|0   582k   4.76m  161 set01h  SEC Apr  7 23:02:17.925
   *76    18  *1094    *69     389   483|0  1.0% 42.7%       0 13.1G 9.96G 0|1 1|0   640k   5.01m  161 set01h  SEC Apr  7 23:02:18.916
   *62    24   *962    *60     247   351|0  1.0% 42.5%       0 13.1G 9.96G 0|0 1|0   426k   4.30m  161 set01h  SEC Apr  7 23:02:19.939
   *67    26  *1046    *61     260   356|0  1.0% 42.5%       0 13.1G 9.96G 0|0 1|0   429k   4.66m  158 set01h  SEC Apr  7 23:02:20.925
   *63    22  *1004    *72     317   423|0  1.0% 42.5%       0 13.1G 9.96G 0|0 1|0   543k   4.90m  158 set01h  SEC Apr  7 23:02:21.912
   *73    22   *963    *68     286   374|0  0.9% 42.5%       0 13.1G 9.96G 0|0 1|0   470k   4.37m  158 set01h  SEC Apr  7 23:02:22.924
   *75    33  *1097    *67     221   308|0  1.0% 42.5%       0 13.1G 9.96G 0|0 1|0   358k   4.81m  158 set01h  SEC Apr  7 23:02:23.926
   *86    31  *1079    *63     361   487|0  1.0% 42.6%       0 13.1G 9.96G 0|2 1|0   640k   5.02m  158 set01h  SEC Apr  7 23:02:24.912
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
   *48    19   *948    *90     245   328|0  1.1% 42.6%       0 13.1G 9.96G 0|0 1|0   394k   4.18m  158 set01h  SEC Apr  7 23:02:25.934
   *78    41  *1095    *72     255   328|0  1.1% 42.7%       0 13.1G 9.96G 0|0 1|0   397k   7.23m  158 set01h  SEC Apr  7 23:02:26.913
   *80    32  *1015    *82     442   521|0  1.2% 42.6%       0 13.1G 9.96G 0|0 1|0   712k   4.91m  161 set01h  SEC Apr  7 23:02:27.924
   *10     4   *172    *13      91   217|0  1.1% 42.6%       0 13.1G 9.96G 0|0 1|0   162k   1.06m  158 set01h  SEC Apr  7 23:02:28.907
    *0    48     *0     *0       0   201|0  1.1% 42.6%       0 13.1G 9.96G 0|0 1|0  53.3k    408k  150 set01h  SEC Apr  7 23:02:29.910
   *37   895   *333    *86      49   438|0  1.1% 42.6%       0 13.1G 9.96G 0|0 1|0   507k   5.37m  166 set01h  SEC Apr  7 23:02:30.914
   *98   464   *750    *76     253   359|0  1.2% 42.6%       0 13.1G 9.96G 0|0 1|0   609k   5.84m  167 set01h  SEC Apr  7 23:02:31.908
  *122    36  *1411   *149     346   453|0  1.2% 42.6%       0 13.1G 9.96G 0|0 1|0   604k   6.59m  167 set01h  SEC Apr  7 23:02:32.912
  *123    44  *1739    *80     311   411|0  1.2% 42.6%       0 13.1G 9.96G 0|1 1|0   534k   7.54m  167 set01h  SEC Apr  7 23:02:33.917
   *84    45  *1324    *72     325   444|0  1.2% 42.4%       0 13.1G 9.96G 1|0 1|0   573k   5.95m  168 set01h  SEC Apr  7 23:02:34.911
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time

 

16vb:SECONDARY member mongostat

nsert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
   *75    18  *1046    *66       0   135|0  0.9% 43.4%       0 13.0G 10.7G 0|0 1|0  28.8k    160k  166 set01h  SEC Apr  7 23:02:18.807
   *65    31  *1004    *66       0   131|0  0.9% 43.4%       0 13.0G 10.7G 0|0 1|0  31.6k    177k  166 set01h  SEC Apr  7 23:02:19.784
   *67    37  *1038    *56       0   128|0  0.7% 42.8%       0 13.0G 10.7G 0|1 1|0  33.1k    172k  163 set01h  SEC Apr  7 23:02:20.785
   *59    28   *973    *69       0   122|0  0.8% 42.8%       0 13.0G 10.7G 0|1 1|0  29.1k    157k  163 set01h  SEC Apr  7 23:02:21.794
   *76    30   *991    *73       0   127|0  0.9% 42.8%       0 13.0G 10.7G 0|0 1|0  30.0k    166k  163 set01h  SEC Apr  7 23:02:22.792
   *70    26  *1110    *70       0   124|0  0.9% 42.9%       0 13.0G 10.7G 0|0 1|0  28.5k    156k  163 set01h  SEC Apr  7 23:02:23.783
   *85    22  *1031    *63       0   122|0  1.0% 42.9%       0 13.0G 10.7G 0|0 2|0  26.8k    175k  163 set01h  SEC Apr  7 23:02:24.794
   *58    21   *958    *87       0   132|0  1.1% 43.0%       0 13.0G 10.7G 0|1 1|0  29.3k    209k  166 set01h  SEC Apr  7 23:02:25.787
   *66    31  *1079    *68       0   134|0  1.1% 43.0%       0 13.0G 10.7G 0|0 1|0  32.4k    569k  164 set01h  SEC Apr  7 23:02:26.796
   *85    44  *1058    *81       0   152|0  1.2% 43.1%       0 13.0G 10.7G 0|0 1|0  40.0k    235k  166 set01h  SEC Apr  7 23:02:27.785
insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
   *19     5   *215    *26       0   126|0  1.2% 43.1%       0 13.0G 10.7G 0|0 1|0  21.6k    162k  166 set01h  SEC Apr  7 23:02:28.784
    *0    31     *0     *0       0   125|0  1.2% 43.1%       0 13.0G 10.7G 0|0 2|0  32.7k    274k  161 set01h  SEC Apr  7 23:02:29.788
   *13   808   *214    *76       0   148|0  1.2% 43.1%       0 13.0G 10.7G 0|0 1|0   341k   3.32m  170 set01h  SEC Apr  7 23:02:30.788
  *108   592   *792    *72       0   115|0  1.3% 43.1%       0 13.0G 10.7G 0|1 1|0   251k   2.52m  172 set01h  SEC Apr  7 23:02:31.787
  *124    42  *1368   *154       0   128|0  1.4% 43.2%       0 13.0G 10.7G 0|1 1|0  34.7k    430k  172 set01h  SEC Apr  7 23:02:32.784
  *130    47  *1759    *80       0   119|0  1.5% 43.3%       0 13.0G 10.7G 0|1 1|0  34.7k    169k  172 set01h  SEC Apr  7 23:02:33.789
   *74    48  *1345    *74       0   127|0  1.6% 43.3%       0 13.0G 10.7G 0|1 1|0  35.9k    176k  172 set01h  SEC Apr  7 23:02:34.787
   *80    43  *1274    *80       0   130|0  1.6% 43.4%       0 13.0G 10.7G 0|0 1|0  35.9k    210k  175 set01h  SEC Apr  7 23:02:35.786
   *92    29  *1167    *85       0   129|0  1.7% 43.4%       0 13.0G 10.7G 0|0 1|0  30.5k    209k  172 set01h  SEC Apr  7 23:02:36.786
   *75    26   *930    *58       0   150|0  1.7% 43.4%       0 13.0G 10.7G 0|0 1|0  33.9k    227k  175 set01h  SEC Apr  7 23:02:37.783

 

Since the PRIMARY seems unavailable, infact NO primaries are seen for 1-2 sec, the application(client) queries are getting timedout resulting in bad customer experience.

We tried to reduce the load(query) to the mongo but it looks the same. We use /etc/hosts for name resolution and do not use DNS at all. 

We need to know why time to time, mongod hogs.  

 

Please provide a secure portal, I can share the diagnostics.data and complete mongodb log during this period. 



 Comments   
Comment by Yuan Fang [ 31/May/23 ]

Closing this ticket as further investigation has been moved to another thread.

Comment by venkataramans rama [ 11/Apr/23 ]

Thanks a lot Yuan for the analysis. Yes the date was April 7. Sorry for the typo.

Yes, we also see primary insert/update reduced during this time but  we believe its due to Mongo high CPU and not reachable to clients.  Also with the same setup if we run with mongo 4.2/4.0 we are not observing this issue. So we believe its something unique to mongo 4.4 and hence we feel a regression issue. 

Also this is not just happening in one setup but multiple setups. So we don't feel its related to network issue.  Also mongod primary reports IDHACK queries (optimized queries) also as Slow query because, its doing some heavy operations (which we don't know what) which is causing the HIGH CPU and Slow query response for even IDHACK queries. 

Please let me know if you need any additional details to find the cause of this problem. 

 

Comment by Yuan Fang [ 10/Apr/23 ]

Hi veramasu@hcl.com,

Thank you for providing the diagnostic data,

Please refer to the timeline April 9th from 23:02:10 to 23:02:38

I noticed that in your comment you referred to the timeline April 9th, but I believe you meant to refer to April 7th instead, since the FTDC data does not cover April 9th.

Based on my analysis of the FTDC data from the primary node, I observed some slow queries and increased latencies around point B (2023-04-07T23:02:29.747Z), which appear to be caused by network issues. This is supported by the decrease in ss network bytesIn and ss network bytesOut between points A and B. The increase in CPU usage is likely due to the server attempting to handle these slow queries. Additionally, while the status of the primary node's self state remained healthy, it was seen as down from secondary node 4 during a period after point B, which I suspect may also be related to network issues.

Since migrating to MongoDB 4.4, we have been encountering a recurring issue of client queries timing out frequently.

I'm unsure if you meant to suggest a regression after upgrading. If so, could you please clarify and provide more information?

Regards,
Yuan

Comment by venkataramans rama [ 10/Apr/23 ]

Thanks Team for the portal link. I have uploaded all the 4 members mongod.log and diagnostics.data

Please refer to the timeline April 9th from 23:02:10 to 23:02:38.

Comment by Yuan Fang [ 10/Apr/23 ]

Hi veramasu@hcl.com,

I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Regards,
Yuan
 

Generated at Thu Feb 08 06:31:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.