Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-75881

mongod 4.4 CPU hogs/stalls leading to client queries timedout

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • ALL

      Hi Team,

      Since migrating to MongoDB 4.4, we have been encountering a recurring issue of client queries timing out frequently. Upon monitoring the CPU usage during these occurrences, we have noticed that mongod's CPU usage is unusually high, despite there not being a significant increase in operations.

       

      This erratic CPU usage has been observed intermittently, and our application has a monitoring thread that checks the replicStatus and reports NO_PRIMARY during such times.

       

      Mongo- Server Version: 4.4.18

      Mongo Client Java Driver version: 3.12.9

      Standalone or Replica : 4 data bearing + 1 arbiter

      Server Option: 

      root       52039  161  2.5 12649240 7229872 ?    S<Ll Apr06 4415:45 /usr/bin/mongod --ipv6 --slowms 500 --storageEngine wiredTiger --wiredTigerCacheSizeGB 22 --enableMajorityReadConcern false --bind_ip_all --keyFile /root/.dbkey --port 27717 --dbpath=/var/data/sessions.1/h --replSet set01h --fork --pidfilepath /var/run/sessionmgr-27717.pid --oplogSize 5120 --logpath /var/log/mongodb-27717.log --logappend --quiet

      Storage Engine : WT

       

      Application Monitors the replicaStatus periodically reporting, TYPE=UNKNOWN for PRIMARY intermittently. 

      2023-04-07 23:02:30,944 [pool-2-thread-1] replicaset status: ReplicaSetStatus{name=set01h, cluster=ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=nd2bwa3psm16vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion

      {versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=795436, setName='set01h', canonicalAddress=nd2bwa3psm16vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414267859844}, ServerDescription{address=nd2bwa3psm15va:27717, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=nd2bwa3psm16va:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}

      , minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1152849, setName='set01h', canonicalAddress=nd2bwa3psm16va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414649210810}, ServerDescription{address=nd2bwa3psm15vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion

      {versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=943462, setName='set01h', canonicalAddress=nd2bwa3psm15vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414619113856}]}}

      Immediately in the next sec, the replicaSet detects PRIMARY

      2023-04-07 23:02:31,260 [pool-2-thread-1] replicaset status: ReplicaSetStatus{name=set01h, cluster=ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=nd2bwa3psm16vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}

      , minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=772997, setName='set01h', canonicalAddress=nd2bwa3psm16vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116414767866623}, ServerDescription{address=nd2bwa3psm15va:27717, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=16803136, setName='set01h', canonicalAddress=nd2bwa3psm15va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=7fffffff0000000000000080, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116414870305431}, ServerDescription{address=nd2bwa3psm16va:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion

      {versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1152849, setName='set01h', canonicalAddress=nd2bwa3psm16va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414649210810}, ServerDescription{address=nd2bwa3psm15vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}

      , minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=943462, setName='set01h', canonicalAddress=nd2bwa3psm15vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:30 UTC 2023, lastUpdateTimeNanos=116414619113856}]}}

       

      But again moves to UKNOWN and later to a proper state

      2023-04-07 23:02:31,544 [pool-2-thread-1] replicaset status: ReplicaSetStatus{name=set01h, cluster=ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=nd2bwa3psm16vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion

      {versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=729934, setName='set01h', canonicalAddress=nd2bwa3psm16vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415267757469}, ServerDescription{address=nd2bwa3psm15va:27717, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=nd2bwa3psm16va:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}

      , minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1059673, setName='set01h', canonicalAddress=nd2bwa3psm16va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415149889939}, ServerDescription{address=nd2bwa3psm15vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion

      {versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=941532, setName='set01h', canonicalAddress=nd2bwa3psm15vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415120136591}]}}

      2023-04-07 23:02:31,764 [pool-3-thread-1] replicaset status: ReplicaSetStatus{name=set01h, cluster=ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=nd2bwa3psm16vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}

      , minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=729934, setName='set01h', canonicalAddress=nd2bwa3psm16vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415267757469}, ServerDescription{address=nd2bwa3psm15va:27717, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=13549634, setName='set01h', canonicalAddress=nd2bwa3psm15va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=7fffffff0000000000000080, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415370985974}, ServerDescription{address=nd2bwa3psm16va:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion

      {versionList=[4, 4, 18]}, minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1059673, setName='set01h', canonicalAddress=nd2bwa3psm16va:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415149889939}, ServerDescription{address=nd2bwa3psm15vb:27717, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 4, 18]}

      , minWireVersion=0, maxWireVersion=9, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=941532, setName='set01h', canonicalAddress=nd2bwa3psm15vb:27717, hosts=[nd2bwa3psm16va:27717, nd2bwa3psm15va:27717, nd2bwa3psm16vb:27717, nd2bwa3psm15vb:27717], passives=[], arbiters=[], primary='nd2bwa3psm15va:27717', tagSet=TagSet{[]}, electionId=null, setVersion=27149688, lastWriteDate=Fri Apr 07 23:02:31 UTC 2023, lastUpdateTimeNanos=116415120136591}]}}

      top - 23:02:27 up 1 day,  7:17,  0 users,  load average: 21.07, 23.62, 26.89
      Tasks: 303 total,   1 running, 301 sleeping,   0 stopped,   1 zombie
      %Cpu(s): 56.5 us, 15.2 sy,  0.0 ni, 22.7 id,  0.0 wa,  2.1 hi,  3.5 si,  0.0 st
      MiB Mem : 273809.9 total, 204977.1 free,  30793.4 used,  38039.4 buff/cache
      MiB Swap:   4096.0 total,   4096.0 free,      0.0 used. 205453.0 avail Mem

          PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
        52039 root       5 -15   12.0g   7.2g  36920 S 237.7   2.7   2977:27 mongod


      top - 23:02:30 up 1 day,  7:17,  0 users,  load average: 21.07, 23.62, 26.89
      Tasks: 301 total,   2 running, 299 sleeping,   0 stopped,   0 zombie
      %Cpu(s): 87.8 us,  4.8 sy,  0.0 ni,  5.3 id,  0.0 wa,  0.9 hi,  1.2 si,  0.0 st
      MiB Mem : 273809.9 total, 204977.1 free,  30793.4 used,  38039.4 buff/cache
      MiB Swap:   4096.0 total,   4096.0 free,      0.0 used. 205453.0 avail Mem

          PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
        52039 root       5 -15   12.1g   7.3g  36920 S 999.9   2.7   2977:48 mongod

      PID : 52039 mongostats
      insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
          65  1449    969     68     701   821|0  1.9% 30.2%       0 12.0G 7.19G 0|1 1|1  5.79m   14.8m  741 set01h  PRI Apr  7 23:02:21.449
          71  1572   1063     76     870   959|0  1.7% 30.2%       0 12.0G 7.20G 0|0 2|0  6.45m   16.2m  744 set01h  PRI Apr  7 23:02:22.426
          69  1581   1064     74     841   988|0  1.8% 30.2%       0 12.0G 7.20G 0|0 1|0  6.53m   16.2m  744 set01h  PRI Apr  7 23:02:23.423
          88  1602   1090     77     793   870|0  1.8% 30.3%       0 12.0G 7.20G 0|0 1|1  6.55m   16.5m  741 set01h  PRI Apr  7 23:02:24.431
          67  1507   1026     78     954  1013|0  1.9% 30.1%       0 12.0G 7.20G 0|0 1|1  6.44m   15.8m  741 set01h  PRI Apr  7 23:02:25.427
          59  1601   1122     86     778   817|0  1.9% 30.1%       0 12.0G 7.19G 0|1 1|0  6.48m   16.6m  741 set01h  PRI Apr  7 23:02:26.419
          81  1609   1068     73     779   908|0  2.0% 30.2%       0 12.0G 7.19G 0|1 1|0  6.46m   16.3m  741 set01h  PRI Apr  7 23:02:27.435
          15   417    189     32     225   364|0  1.9% 30.1%       0 11.9G 7.31G 0|0 2|8  1.44m   2.93m  612 set01h  PRI Apr  7 23:02:30.352
          *0   184     *0      9      18  2643|0  1.9% 30.2%       0 11.8G 7.32G 0|0 1|1   318k   2.89m  413 set01h  PRI Apr  7 23:02:30.564
         140  1499    809     94     364  6105|0  2.0% 30.2%       0 11.6G 7.31G 0|0 1|5  5.90m   17.7m  292 set01h  PRI Apr  7 23:02:31.422
      insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time

       

      The response time goes to minutes

                           ns    total     read    write    2023-04-07T23:02:27Z
      session_cache_2.session    321ms     86ms    235ms
        session_cache.session    289ms     63ms    225ms
               local.oplog.rs    274ms    274ms      0ms
      session_cache_4.session    251ms     67ms    184ms
      session_cache_3.session    223ms     58ms    165ms
            admin.system.keys      0ms      0ms      0ms
           admin.system.roles      0ms      0ms      0ms
           admin.system.users      0ms      0ms      0ms
         admin.system.version      0ms      0ms      0ms
      config.image_collection      0ms      0ms      0ms

                           ns    total     read    write    2023-04-07T23:02:29Z
               local.oplog.rs    147ms    147ms      0ms
      session_cache_2.session     96ms     27ms     68ms
      session_cache_3.session     73ms     20ms     53ms
        session_cache.session     71ms     17ms     53ms
      session_cache_4.session     63ms     14ms     48ms
            admin.system.keys      0ms      0ms      0ms
           admin.system.roles      0ms      0ms      0ms
           admin.system.users      0ms      0ms      0ms
         admin.system.version      0ms      0ms      0ms
      config.image_collection      0ms      0ms      0ms

                           ns       total        read      write    2023-04-07T23:02:30Z
      session_cache_4.session    187974ms    126810ms    61164ms
        session_cache.session    159628ms    116939ms    42689ms
      session_cache_2.session    157818ms    114013ms    43804ms
      session_cache_3.session    132527ms     88974ms    43552ms
               local.oplog.rs      4639ms      4639ms        0ms
            admin.system.keys         0ms         0ms        0ms
           admin.system.roles         0ms         0ms        0ms
           admin.system.users         0ms         0ms        0ms
         admin.system.version         0ms         0ms        0ms
      config.image_collection         0ms         0ms        0ms

                           ns     total     read     write    2023-04-07T23:02:31Z
        session_cache.session    1541ms    374ms    1166ms
      session_cache_3.session    1306ms    294ms    1011ms
      session_cache_2.session    1266ms    247ms    1018ms
      session_cache_4.session    1064ms    232ms     831ms
               local.oplog.rs     338ms    338ms       0ms
            admin.system.keys       0ms      0ms       0ms
           admin.system.roles       0ms      0ms       0ms
           admin.system.users       0ms      0ms       0ms
         admin.system.version       0ms      0ms       0ms
      config.image_collection       0ms      0ms       0ms

                           ns     total     read     write    2023-04-07T23:02:32Z
      session_cache_4.session    1453ms    178ms    1274ms
        session_cache.session    1367ms    188ms    1178ms
      session_cache_3.session    1303ms     95ms    1207ms
      session_cache_2.session    1210ms    175ms    1034ms
               local.oplog.rs     273ms    273ms       0ms
            admin.system.keys       0ms      0ms       0ms
           admin.system.roles       0ms      0ms       0ms
           admin.system.users       0ms      0ms       0ms
         admin.system.version       0ms      0ms       0ms
      config.image_collection       0ms      0ms       0ms

                           ns     total     read     write    2023-04-07T23:02:33Z
      session_cache_2.session    1407ms    116ms    1291ms
      session_cache_4.session    1188ms    302ms     886ms
      session_cache_3.session    1147ms     79ms    1068ms
        session_cache.session     839ms     92ms     746ms
               local.oplog.rs     280ms    280ms       0ms
            admin.system.keys       0ms      0ms       0ms
           admin.system.roles       0ms      0ms       0ms
           admin.system.users       0ms      0ms       0ms
         admin.system.version       0ms      0ms       0ms
      config.image_collection       0ms      0ms       0ms

                           ns     total     read     write    2023-04-07T23:02:34Z
      session_cache_2.session    1230ms     87ms    1142ms
      session_cache_3.session    1208ms    136ms    1071ms
        session_cache.session    1133ms     56ms    1076ms
      session_cache_4.session    1118ms     65ms    1053ms
               local.oplog.rs     347ms    347ms       0ms
            admin.system.keys       0ms      0ms       0ms
           admin.system.roles       0ms      0ms       0ms
           admin.system.users       0ms      0ms       0ms
         admin.system.version       0ms      0ms       0ms
      config.image_collection       0ms      0ms       0ms

       

      CPU idle% goes to 0.0
      Linux 4.18.0-372.26.1.el8_6.x86_64 (nd2bwa3psm15va)     04/07/2023      x86_64        (10 CPU)

      11:02:09 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
      11:02:10 PM     all     69.90      0.00     21.21      0.00      0.00      8.89
      11:02:11 PM     all     65.22      0.00     22.38      0.00      0.00     12.40
      11:02:12 PM     all     54.65      0.00     20.63      0.00      0.00     24.72
      11:02:13 PM     all     53.48      0.00     21.00      0.00      0.00     25.51
      11:02:14 PM     all     50.76      0.00     23.45      0.00      0.00     25.79
      11:02:15 PM     all     56.42      0.00     21.44      0.00      0.00     22.14
      11:02:16 PM     all     56.44      0.00     22.49      0.00      0.00     21.06
      11:02:17 PM     all     55.44      0.00     21.15      0.00      0.00     23.41
      11:02:18 PM     all     52.41      0.00     20.16      0.00      0.00     27.43
      11:02:19 PM     all     54.22      0.00     21.67      0.00      0.00     24.11
      11:02:20 PM     all     52.24      0.00     20.88      0.00      0.00     26.88
      11:02:21 PM     all     61.88      0.00     21.44      0.00      0.00     16.68
      11:02:22 PM     all     62.61      0.00     21.78      0.00      0.00     15.60
      11:02:23 PM     all     50.36      0.00     19.16      0.00      0.00     30.48
      11:02:24 PM     all     54.06      0.00     22.72      0.00      0.00     23.23
      11:02:25 PM     all     55.25      0.00     20.39      0.00      0.00     24.36
      11:02:26 PM     all     50.36      0.00     18.04      0.00      0.00     31.60
      11:02:27 PM     all     52.86      0.00     19.29      0.00      0.00     27.86
      11:02:28 PM     all     67.58      0.00     15.15      0.00      0.00     17.27
      11:02:29 PM     all     97.20      0.00      2.80      0.00      0.00      0.00
      11:02:30 PM     all     88.59      0.00      9.11      0.00      0.00      2.30
      11:02:31 PM     all     59.90      0.00     26.40      0.00      0.00     13.71
      11:02:32 PM     all     63.96      0.00     23.49      0.00      0.00     12.55
      11:02:33 PM     all     70.56      0.00     21.88      0.00      0.00      7.56
      11:02:34 PM     all     65.70      0.00     23.00      0.00      0.00     11.30

      Linux 4.18.0-372.26.1.el8_6.x86_64 (nd2bwa3psm15va)     04/07/2023      x86_64        (10 CPU)

      11:02:09 PM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
      11:02:10 PM        28      2985     22.28     23.99     27.08         0
      11:02:11 PM        17      2994     22.28     23.99     27.08         0
      11:02:12 PM        24      3025     22.28     23.99     27.08         0
      11:02:13 PM        10      3024     22.58     24.02     27.07         0
      11:02:14 PM         6      2956     22.58     24.02     27.07         0
      11:02:15 PM        25      2959     22.58     24.02     27.07         0
      11:02:16 PM        37      2960     22.58     24.02     27.07         0
      11:02:17 PM         7      2964     22.58     24.02     27.07         0
      11:02:18 PM        10      2968     21.89     23.86     27.00         0
      11:02:19 PM        21      2968     21.89     23.86     27.00         0
      11:02:20 PM        25      3001     21.89     23.86     27.00         0
      11:02:21 PM        35      2999     21.89     23.86     27.00         0
      11:02:22 PM        20      3037     21.89     23.86     27.00         0
      11:02:23 PM        23      3037     21.42     23.73     26.94         0
      11:02:24 PM        29      2983     21.42     23.73     26.94         0
      11:02:25 PM        11      2981     21.42     23.73     26.94         0
      11:02:26 PM        18      2971     21.42     23.73     26.94         0
      11:02:27 PM        13      2972     21.42     23.73     26.94         0
      11:02:28 PM       300      2985     21.07     23.62     26.89         0
      11:02:29 PM       741      3160     21.07     23.62     26.89         0
      11:02:30 PM       115      2612     21.07     23.62     26.89         0

       

      SM16va: SECONDARY member mongostat

      insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
         *69    18  *1018    *83       0   120|0  0.9% 44.1%       0 13.0G 10.7G 0|0 1|0  25.2k    170k  153 set01h  SEC Apr  7 23:02:15.245
         *67    23   *935    *79       0   134|0  1.0% 44.1%       0 13.0G 10.7G 0|0 1|0  29.4k    167k  153 set01h  SEC Apr  7 23:02:16.231
         *71    24   *984    *67       0   132|0  1.0% 44.1%       0 13.0G 10.7G 0|0 1|0  29.5k    229k  156 set01h  SEC Apr  7 23:02:17.227
         *67    27  *1061    *72       0   136|0  1.0% 44.0%       0 13.0G 10.7G 0|0 1|0  31.2k    263k  153 set01h  SEC Apr  7 23:02:18.229
         *76    24  *1051    *65       0   136|0  1.0% 44.0%       0 13.0G 10.7G 0|0 1|0  31.0k    170k  156 set01h  SEC Apr  7 23:02:19.226
         *59    28  *1028    *65       0   128|0  1.0% 44.0%       0 13.0G 10.7G 0|0 1|0  29.7k    166k  156 set01h  SEC Apr  7 23:02:20.225
         *67    27   *934    *65       0   123|0  1.1% 43.9%       0 13.0G 10.7G 0|0 1|0  28.7k    157k  153 set01h  SEC Apr  7 23:02:21.232
         *64    28  *1079    *72       0   125|0  1.1% 43.9%       0 13.0G 10.7G 0|0 2|0  29.2k    166k  153 set01h  SEC Apr  7 23:02:22.226
         *70    22   *900    *60       0   118|0  1.1% 43.9%       0 13.0G 10.7G 0|0 1|0  26.1k    158k  153 set01h  SEC Apr  7 23:02:23.249
         *83    29  *1140    *69       0   140|0  1.1% 43.9%       0 13.0G 10.7G 0|1 1|0  32.8k    180k  153 set01h  SEC Apr  7 23:02:24.235
      insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
         *80    35  *1039    *74       0   121|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0  30.9k    169k  153 set01h  SEC Apr  7 23:02:25.240
         *50    28   *993    *78       0   127|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0  29.5k    163k  153 set01h  SEC Apr  7 23:02:26.227
         *85    33  *1113    *81       0   129|0  1.1% 43.8%       0 13.0G 10.7G 0|1 1|0  31.6k    162k  156 set01h  SEC Apr  7 23:02:27.251
         *66    16   *653    *74       0   129|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0  26.3k    212k  156 set01h  SEC Apr  7 23:02:28.223
          *0     5     *0     *0       0   149|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0  27.9k    310k  156 set01h  SEC Apr  7 23:02:29.229
          *0   261     *0     *0       0   126|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0   123k   1.23m  159 set01h  SEC Apr  7 23:02:30.227
         *64   947   *564    *96       0   175|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0   400k   3.81m  171 set01h  SEC Apr  7 23:02:31.238
         *87   251   *933    *82       0   115|0  1.1% 43.8%       0 13.0G 10.7G 0|0 1|0   116k   1.13m  170 set01h  SEC Apr  7 23:02:32.231
        *124    41  *1562   *155       0   128|0  1.2% 43.6%       0 13.0G 10.7G 0|0 1|0  34.0k    198k  170 set01h  SEC Apr  7 23:02:33.227
        *119    33  *1553    *71       0   124|0  1.2% 43.6%       0 13.0G 10.7G 0|0 1|0  31.2k    179k  170 set01h  SEC Apr  7 23:02:34.236
      insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time

       

      15vb:SECONDARY member mongostat

      insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
         *61    29   *952    *77     434   532|0  0.9% 42.7%       0 13.1G 9.96G 0|0 1|0   733k   4.49m  158 set01h  SEC Apr  7 23:02:15.917
         *74    20   *978    *63     369   413|0  0.9% 42.7%       0 13.1G 9.96G 0|0 1|0   540k   7.30m  161 set01h  SEC Apr  7 23:02:16.909
         *64    24  *1010    *77     342   465|0  0.9% 42.8%       0 13.1G 9.96G 0|0 1|0   582k   4.76m  161 set01h  SEC Apr  7 23:02:17.925
         *76    18  *1094    *69     389   483|0  1.0% 42.7%       0 13.1G 9.96G 0|1 1|0   640k   5.01m  161 set01h  SEC Apr  7 23:02:18.916
         *62    24   *962    *60     247   351|0  1.0% 42.5%       0 13.1G 9.96G 0|0 1|0   426k   4.30m  161 set01h  SEC Apr  7 23:02:19.939
         *67    26  *1046    *61     260   356|0  1.0% 42.5%       0 13.1G 9.96G 0|0 1|0   429k   4.66m  158 set01h  SEC Apr  7 23:02:20.925
         *63    22  *1004    *72     317   423|0  1.0% 42.5%       0 13.1G 9.96G 0|0 1|0   543k   4.90m  158 set01h  SEC Apr  7 23:02:21.912
         *73    22   *963    *68     286   374|0  0.9% 42.5%       0 13.1G 9.96G 0|0 1|0   470k   4.37m  158 set01h  SEC Apr  7 23:02:22.924
         *75    33  *1097    *67     221   308|0  1.0% 42.5%       0 13.1G 9.96G 0|0 1|0   358k   4.81m  158 set01h  SEC Apr  7 23:02:23.926
         *86    31  *1079    *63     361   487|0  1.0% 42.6%       0 13.1G 9.96G 0|2 1|0   640k   5.02m  158 set01h  SEC Apr  7 23:02:24.912
      insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
         *48    19   *948    *90     245   328|0  1.1% 42.6%       0 13.1G 9.96G 0|0 1|0   394k   4.18m  158 set01h  SEC Apr  7 23:02:25.934
         *78    41  *1095    *72     255   328|0  1.1% 42.7%       0 13.1G 9.96G 0|0 1|0   397k   7.23m  158 set01h  SEC Apr  7 23:02:26.913
         *80    32  *1015    *82     442   521|0  1.2% 42.6%       0 13.1G 9.96G 0|0 1|0   712k   4.91m  161 set01h  SEC Apr  7 23:02:27.924
         *10     4   *172    *13      91   217|0  1.1% 42.6%       0 13.1G 9.96G 0|0 1|0   162k   1.06m  158 set01h  SEC Apr  7 23:02:28.907
          *0    48     *0     *0       0   201|0  1.1% 42.6%       0 13.1G 9.96G 0|0 1|0  53.3k    408k  150 set01h  SEC Apr  7 23:02:29.910
         *37   895   *333    *86      49   438|0  1.1% 42.6%       0 13.1G 9.96G 0|0 1|0   507k   5.37m  166 set01h  SEC Apr  7 23:02:30.914
         *98   464   *750    *76     253   359|0  1.2% 42.6%       0 13.1G 9.96G 0|0 1|0   609k   5.84m  167 set01h  SEC Apr  7 23:02:31.908
        *122    36  *1411   *149     346   453|0  1.2% 42.6%       0 13.1G 9.96G 0|0 1|0   604k   6.59m  167 set01h  SEC Apr  7 23:02:32.912
        *123    44  *1739    *80     311   411|0  1.2% 42.6%       0 13.1G 9.96G 0|1 1|0   534k   7.54m  167 set01h  SEC Apr  7 23:02:33.917
         *84    45  *1324    *72     325   444|0  1.2% 42.4%       0 13.1G 9.96G 1|0 1|0   573k   5.95m  168 set01h  SEC Apr  7 23:02:34.911
      insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time

       

      16vb:SECONDARY member mongostat

      nsert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
         *75    18  *1046    *66       0   135|0  0.9% 43.4%       0 13.0G 10.7G 0|0 1|0  28.8k    160k  166 set01h  SEC Apr  7 23:02:18.807
         *65    31  *1004    *66       0   131|0  0.9% 43.4%       0 13.0G 10.7G 0|0 1|0  31.6k    177k  166 set01h  SEC Apr  7 23:02:19.784
         *67    37  *1038    *56       0   128|0  0.7% 42.8%       0 13.0G 10.7G 0|1 1|0  33.1k    172k  163 set01h  SEC Apr  7 23:02:20.785
         *59    28   *973    *69       0   122|0  0.8% 42.8%       0 13.0G 10.7G 0|1 1|0  29.1k    157k  163 set01h  SEC Apr  7 23:02:21.794
         *76    30   *991    *73       0   127|0  0.9% 42.8%       0 13.0G 10.7G 0|0 1|0  30.0k    166k  163 set01h  SEC Apr  7 23:02:22.792
         *70    26  *1110    *70       0   124|0  0.9% 42.9%       0 13.0G 10.7G 0|0 1|0  28.5k    156k  163 set01h  SEC Apr  7 23:02:23.783
         *85    22  *1031    *63       0   122|0  1.0% 42.9%       0 13.0G 10.7G 0|0 2|0  26.8k    175k  163 set01h  SEC Apr  7 23:02:24.794
         *58    21   *958    *87       0   132|0  1.1% 43.0%       0 13.0G 10.7G 0|1 1|0  29.3k    209k  166 set01h  SEC Apr  7 23:02:25.787
         *66    31  *1079    *68       0   134|0  1.1% 43.0%       0 13.0G 10.7G 0|0 1|0  32.4k    569k  164 set01h  SEC Apr  7 23:02:26.796
         *85    44  *1058    *81       0   152|0  1.2% 43.1%       0 13.0G 10.7G 0|0 1|0  40.0k    235k  166 set01h  SEC Apr  7 23:02:27.785
      insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn    set repl                time
         *19     5   *215    *26       0   126|0  1.2% 43.1%       0 13.0G 10.7G 0|0 1|0  21.6k    162k  166 set01h  SEC Apr  7 23:02:28.784
          *0    31     *0     *0       0   125|0  1.2% 43.1%       0 13.0G 10.7G 0|0 2|0  32.7k    274k  161 set01h  SEC Apr  7 23:02:29.788
         *13   808   *214    *76       0   148|0  1.2% 43.1%       0 13.0G 10.7G 0|0 1|0   341k   3.32m  170 set01h  SEC Apr  7 23:02:30.788
        *108   592   *792    *72       0   115|0  1.3% 43.1%       0 13.0G 10.7G 0|1 1|0   251k   2.52m  172 set01h  SEC Apr  7 23:02:31.787
        *124    42  *1368   *154       0   128|0  1.4% 43.2%       0 13.0G 10.7G 0|1 1|0  34.7k    430k  172 set01h  SEC Apr  7 23:02:32.784
        *130    47  *1759    *80       0   119|0  1.5% 43.3%       0 13.0G 10.7G 0|1 1|0  34.7k    169k  172 set01h  SEC Apr  7 23:02:33.789
         *74    48  *1345    *74       0   127|0  1.6% 43.3%       0 13.0G 10.7G 0|1 1|0  35.9k    176k  172 set01h  SEC Apr  7 23:02:34.787
         *80    43  *1274    *80       0   130|0  1.6% 43.4%       0 13.0G 10.7G 0|0 1|0  35.9k    210k  175 set01h  SEC Apr  7 23:02:35.786
         *92    29  *1167    *85       0   129|0  1.7% 43.4%       0 13.0G 10.7G 0|0 1|0  30.5k    209k  172 set01h  SEC Apr  7 23:02:36.786
         *75    26   *930    *58       0   150|0  1.7% 43.4%       0 13.0G 10.7G 0|0 1|0  33.9k    227k  175 set01h  SEC Apr  7 23:02:37.783

       

      Since the PRIMARY seems unavailable, infact NO primaries are seen for 1-2 sec, the application(client) queries are getting timedout resulting in bad customer experience.

      We tried to reduce the load(query) to the mongo but it looks the same. We use /etc/hosts for name resolution and do not use DNS at all. 

      We need to know why time to time, mongod hogs.  

       

      Please provide a secure portal, I can share the diagnostics.data and complete mongodb log during this period. 

        1. Screenshot 2023-04-10 at 5.52.06 PM.png
          Screenshot 2023-04-10 at 5.52.06 PM.png
          441 kB
        2. Screenshot 2023-04-11 at 4.31.44 PM.png
          Screenshot 2023-04-11 at 4.31.44 PM.png
          192 kB
        3. mongodb-27717.log
          741 kB
        4. 4.0_Primary_Startup_mongodb-27717.log
          171 kB
        5. 4.0_parameters
          12 kB
        6. 4.2_parameters
          16 kB
        7. mongo_4.0_diag.tar.gz
          105.17 MB
        8. clientlog.log
          406 kB
        9. clientsettings.jpg
          clientsettings.jpg
          231 kB

            Assignee:
            yuan.fang@mongodb.com Yuan Fang
            Reporter:
            veramasu@hcl.com venkataramans rama
            Votes:
            24 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: