[SERVER-67217] High query response on primary (3-4 sec range) in mongo 4.2.20 version Created: 11/Jun/22  Updated: 18/Jan/23  Resolved: 01/Sep/22

Status: Closed
Project: Core Server
Component/s: Performance, Querying, WiredTiger
Affects Version/s: 4.2.20
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: KAPIL GUPTA Assignee: Chris Kelly
Resolution: Done Votes: 9
Labels: Bug
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

java client driver on client side - 3.12.9
mongo version - 4.2.20
Storage Engine - WT
Storage Type - tmpfs
Replica-set : 7 members (4 non-arbiter and 3 arbiter , all voting members)
One of the member CMD as an example:
mongod --keyFile=/mongodb.key --enableMajorityReadConcern false --ipv6 --bind_ip_all --port 27021 --dbpath=/data/db/wt-27021 --replSet rs-shard-2 --quiet --slowms 500 --logpath /data/db/mongo-27021.log --setParameter diagnosticDataCollectionEnabled=true --logappend --oplogSize 3221 --logRotate reopen --wiredTigerCacheSizeGB 4.40


Attachments: PNG File image-2022-06-27-02-18-16-686.png     PNG File image-2022-06-27-02-23-01-877.png     PNG File image-2022-06-27-02-25-35-724.png     PNG File image-2022-06-27-02-26-11-985.png     PNG File image-2022-06-27-02-34-24-202.png     PNG File image-2022-06-27-02-34-36-415.png     PNG File image-2022-06-27-02-50-27-830.png     PNG File image-2022-06-30-06-21-23-457.png     PNG File image-2022-06-30-06-26-01-013.png     PNG File image-2022-06-30-06-28-57-244.png     PNG File image-2022-07-08-09-23-16-585.png     PNG File image-2022-07-08-09-33-45-541.png     PNG File image-2022-07-08-09-43-02-073.png     PNG File image-2022-07-08-10-24-12-094.png     PNG File image-2022-07-08-10-24-15-944.png     PNG File image-2022-09-01-15-09-05-589.png     PNG File image-2022-09-01-15-09-39-664.png     PNG File image-2022-09-01-15-15-42-817.png     Text File rs.status.txt     PNG File screenshot-1.png    
Issue Links:
Related
related to SERVER-72978 Random high query response and very h... Investigating
Operating System: ALL
Steps To Reproduce:

observed high query response after 6 hours of rebooting of one DB VM.

Participants:

 Description   

Scenario - We run tests with same load on 2 mongo versions(4.0.27,4.2.20).
We found high query response on primary(in range of 3-4 seconds) for 4.2.20 version while found no issue with 4.0.27.

High query response issue start timing: 2022-06-10 20:03 UTC

Few high query response messages from primary:

2022-06-10T20:03:52.016+0000 I  WRITE    [conn425] update drasessions_1.drasessions command: { q: { _id:

{ sessionid: "ClpGx0:172.16.241.111:5021:1654866656:0002395073" }

, ts: { $lte: 1654891428485 }, srk: "serverb1.relay" }, u: { _id:

{ sessionid: "ClpGx0:172.16.241.111:5021:1654866656:0002395073" }

, ts: 1654891428485, srk: "serverb1.relay", originHost: "site-b-client-calipers21-gx.pcef.gx1", originRealm: "client-consumer.calipers.pcef.gx", destHost: "site-b-server-calipers81-gx.pcef.b-gx-b1.2", destRealm: "server-consumer.calipers.pcef.gx", apn: "ims.com", imsi: "320000000295072", ipv6: "3101:0000:0004:809f", msisdn: "420000000295072", nextEvalTime: new Date(1654981428485), staleSessionExpiryCount: 3, staleBindingRefreshTime: new Date(1656619428486), sessionid: "ClpGx0:172.16.241.111:5021:1654866656:0002395073", systemId: "vpas-system-2", uuid: "vpas-system-21357665800" }, multi: false, upsert: true } planSummary: IXSCAN { _id: 1 } keysExamined:0 docsExamined:0 nMatched:0 nModified:0 upsert:1 keysInserted:2 numYields:0 queryHash:1380E850 planCacheKey:9313C0BA locks:{ ParallelBatchWriterMode: { acquireCount:

{ r: 1 }

}, ReplicationStateTransition: { acquireCount:

{ w: 1 }

}, Global: { acquireCount:

{ w: 1 }

}, Database: { acquireCount:

{ w: 1 }

}, Collection: { acquireCount:

{ w: 1 }

}, Mutex: { acquireCount:

{ r: 2 }

} } flowControl:{ acquireCount: 1 } storage:{} 3530ms
2022-06-10T20:03:52.016+0000 I  COMMAND  [conn52] command ipv6bindings_1.ipv6bindings command: find { find: "ipv6bindings", filter:

{ _id: "3101:0000:0009:11c1" }

, limit: 1, singleBatch: true, $db: "ipv6bindings_1", $clusterTime: { clusterTime: Timestamp(1654891428, 246), signature:

{ hash: BinData(0, 1AE068B7041E0DF364A48B3B76192706D425B50D), keyId: 7072624733400858628 }

}, lsid: { id: UUID("55caafc8-fbb1-4ede-a1d8-dd788fadb341") }, $readPreference: { mode: "nearest" } } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:520 locks:{ ReplicationStateTransition: { acquireCount:

{ w: 2 }

}, Global: { acquireCount:

{ r: 2 }

}, Database: { acquireCount:

{ r: 2 }

}, Collection: { acquireCount:

{ r: 2 }

}, Mutex: { acquireCount:

{ r: 1 }

} } storage:{} protocol:op_msg 3441ms

Attached files: 
1) rs.status.txt: contains ** replica set status(attached with the ticket)
2) mongo-diag-log.tar.gz(need to upload): kindly provide support portal link to upload (contains mongod.logs and diagnostic.data for all data members) file structure is given below:
root@xyz:~$ sudo tar -tf mongo-diag-log.tar.gz
mongo-27029_PRIMARY_2a.tar.gz
mongo-27029-SECONDARY_2b.tar.gz
mongo-27029-SECONDARY_8a.tar.gz
mongo-27029-SECONDARY_8b.tar.gz
root@xyz:~$ tar -tf mongo-27029_PRIMARY_2a.tar.gz
mongo-27029_PRIMARY.log
diagnostic.data/
diagnostic.data/metrics.interim
diagnostic.data/metrics.2022-06-10T14-51-15Z-00000
diagnostic.data/metrics.2022-06-10T22-51-19Z-00000
root@xyz~$ sudo tar -tf mongo-27029-SECONDARY_2b.tar.gz
mongo-27029-SECONDARY_2b.log
diagnostic.data/
diagnostic.data/metrics.2022-06-10T21-26-19Z-00000
diagnostic.data/metrics.interim
diagnostic.data/metrics.2022-06-10T04-56-24Z-00000
diagnostic.data/metrics.2022-06-11T05-01-19Z-00000
diagnostic.data/metrics.2022-06-10T13-51-31Z-00000
root@xyz:~$ sudo tar -tf mongo-27029-SECONDARY_8a.tar.gz
mongo-27029-SECONDARY_8a.log
diagnostic.data/
diagnostic.data/metrics.2022-06-09T18-40-08Z-00000
diagnostic.data/metrics.2022-06-10T03-25-08Z-00000
diagnostic.data/metrics.interim
diagnostic.data/metrics.2022-06-09T01-21-07Z-00000
diagnostic.data/metrics.2022-06-10T13-01-28Z-00000
diagnostic.data/metrics.2022-06-10T22-36-19Z-00000
diagnostic.data/metrics.2022-06-09T10-00-08Z-00000
root@xyz:~$ sudo tar -tf mongo-27029-SECONDARY_8b.tar.gz
mongo-27029-SECONDARY_8b.log
diagnostic.data/
diagnostic.data/metrics.2022-06-10T20-21-19Z-00000
diagnostic.data/metrics.2022-06-10T10-56-28Z-00000
diagnostic.data/metrics.2022-06-11T06-06-19Z-00000
diagnostic.data/metrics.interim
diagnostic.data/metrics.2022-06-10T01-50-08Z-00000
diagnostic.data/metrics.2022-06-09T06-09-14Z-00000
diagnostic.data/metrics.2022-06-09T15-55-08Z-00000



 Comments   
Comment by KAPIL GUPTA [ 22/Dec/22 ]

Hi Christopher,

We tested with 4.2 and 4.4 and again we found same issue on 4.2 member.

This time we also collected disk read write but we did not find any high disk usage.

Kindly reopen the case and please provide link to upload the supported logs.

 

Thanks,

Kapil

Comment by Chris Kelly [ 01/Sep/22 ]

Hi Kapil,

Thank you for your patience. We have looked through your data and and have not found information that implicates the mongod for your latency issues. Unfortunately, at this time that means we're unable to determine the cause of the behavior from the given information.

Some points that were interesting:

  • In all cases you presented, there is high, unexplained disk usage (not from the mongod) that eventually leads the mongod start queueing for a few seconds until it resolves itself. This is unexplained because it does not correlate with activity in the journal (wt log log) or the block manager. 
    • Point A: 2022-07-05T00:45:48.308Z 
  • The data you provided comparing 4.0 and 4.2 was mentioned as being in two separate deployments ("sites" in you case), so it is possible there are differences between these two that could explain the issue
  • In your 4.2 data, I see a cyclic high disk usage very 15 minutes from some unexplained process that leads to the mongod queueing. I don't see this on your 4.0 data (and I see multiple extra drives on your 4.0 instance, which further suggests your sites vary in some other way that can explain the problem)

I would suggest an investigation into what may be occurring on this system to cause this, since it is not evident that it is coming from the mongod. As it stands I am going to close this because it seems something else is doing this every 15 minutes.

If you like, you can take this discussion further  by asking our community for help by posting on the MongoDB Developer Community Forums.

If the discussion there leads you to suspect a bug in the MongoDB server, then we'd want to investigate it as a possible bug here in the SERVER project. Specifically in this case, that would also entail figuring out the source of your disk activity here.

Regards,
Christopher

Comment by KAPIL GUPTA [ 16/Aug/22 ]

Hi Christopher,

Just a gentle reminder! please have a look.

 

Regards,

Kapil

Comment by KAPIL GUPTA [ 05/Aug/22 ]

Thanks much Christopher for quick response and providing me the link.

I have uploaded the combinedmongologs.tar.gz in the portal. Kindly let us know your observation.

 

Regards,

Kapil

Comment by Chris Kelly [ 05/Aug/22 ]

Hi Kapil,

I've gone ahead and created another Amphora Portal link for you to upload to. 

As for not being able to collect gdb, I would also suggest the SIGUSR2 signal but I believe this is only supported in newer versions of MongoDB. The backtrace would clarify what is actually being worked on, so without it I'll have to infer what bad behavior is going on. I can take another look at your next set of logs at the timestamp you specified and see if I can discern something interesting.

Christopher

Comment by KAPIL GUPTA [ 05/Aug/22 ]

Hi Christopher,

We did testing again and got total 3 occurances(collected logs for 2) for the issue.

This time, we collected additional logs also (top output, journalctl logs and dmesg, mongotop and mognostat).

Please find the details for the logs along with observation given below:

Occurrence 1: mongo server did not respond for  Aug  3 14:45:36 timestamp which created multiple problems(high query response in next few second as client side it got disconnected message and it tried to reconnect which created pressure on mongo which increases cpu on mongo)

  • supported logs1: in mongo logs(name: mongo-27045.log.issue1.gz), we observed that 3 the time stamps (14:45:35,14:45:36,14:45:37) were not in order, 14:45:37 came before 14:45:35 and 14:45:36 in some instance like given below:
    2022-08-03T14:45:35.971+0000 D2 QUERY    [conn283] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key: { _id: 1 }

    , name: "id", ns: "imsiApnbindings_1.imsiApnbindings" }
    2022-08-03T14:45:37.553+0000 D2 QUERY    [conn283] Only one plan is available; it will be run but will not be cached. ns: imsiApnbindings_1.imsiApnbindings query:

    Unknown macro: { _id}

    sort: {} projection: {}, planSummary: IXSCAN { _id: 1 }
    2022-08-03T14:45:37.553+0000 I  NETWORK  [conn81] end connection [2606:ae00:3001:8311:172:16:244:ae]:54464 (748 connections now open)
    2022-08-03T14:45:37.553+0000 I  NETWORK  [conn19] end connection [2606:ae00:3001:8311:172:16:244:bf]:46106 (747 connections now open)
    2022-08-03T14:45:35.972+0000 I  COMMAND  [conn43] command admin.$cmd command: isMaster

    { ismaster: 1, $db: "admin", $clusterTime: Unknown macro: \{ clusterTime}

    } numYields:0 reslen:1072 locks:{} protocol:op_msg 0ms
    2022-08-03T14:45:35.972+0000 D2 QUERY    [conn1093] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:

    { _id: 1 }

    , name: "id", ns: "imsiApnbindings_2.imsiApnbindings" }
    2022-08-03T14:45:35.973+0000 D2 QUERY    [conn1034] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:

    { _id: 1 }

    , name: "id", ns: "imsiApnbindings_2.imsiApnbindings" }
    2022-08-03T14:45:35.974+0000 D2 QUERY    [conn1146] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:

    { _id: 1 }

    , name: "id", ns: "imsiApnbindings_2.imsiApnbindings" }

2022-08-03T14:45:37.566+0000 I  NETWORK  [conn66] end connection [2606:ae00:3001:8311:172:16:244:c2]:35400 (737 connections now open)
2022-08-03T14:45:36.001+0000 I  COMMAND  [conn21] command admin.$cmd command: isMaster { ismaster: 1, $db: "admin", $clusterTime: { clusterTime: Timestamp(1659537935, 1283), signature:

{ hash: BinData(0, 04438EF10586291946AB881A538C20917AA3EDB7), keyId: 7126121256901935109 }

} } numYields:0 reslen:1072 locks:{} protocol:op_msg 0ms
2022-08-03T14:45:35.998+0000 I  COMMAND  [conn46] command admin.$cmd command: isMaster { ismaster: 1, $db: "admin", $clusterTime: { clusterTime: Timestamp(1659537935, 987), signature:

{ hash: BinData(0, 04438EF10586291946AB881A538C20917AA3EDB7), keyId: 7126121256901935109 }

} } numYields:0 reslen:1072 locks:{} protocol:op_msg 0ms
2022-08-03T14:45:36.001+0000 D2 QUERY    [conn1408] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:

{ _id: 1 }

, name: "id", ns: "imsiApnbindings_2.imsiApnbindings" }
2022-08-03T14:45:36.005+0000 I  COMMAND  [conn45] command admin.$cmd command: isMaster { ismaster: 1, $db: "admin", $clusterTime: { clusterTime: Timestamp(1659537935, 1158), signature:

{ hash: BinData(0, 04438EF10586291946AB881A538C20917AA3EDB7), keyId: 7126121256901935109 }

} } numYields:0 reslen:1072 locks:{} protocol:op_msg 0ms
2022-08-03T14:45:37.566+0000 I  NETWORK  [conn46] end connection

  • supported logs2: name: mongostat27045.txt: 14:45:36 timesatmp is missed
    insert query update delete getmore command dirty  used flushes vsize   res   qrw   arw net_in net_out conn                 set repl                time
        *0     4    749    715      25   147|0  1.7% 77.1%       0 6.07G 3.54G   0|1   1|3   847k   1.98m  764 rs-app_shard-imsi-1  PRI Aug  3 14:45:29.419
        *0     5    633    795      27   149|0  1.7% 77.2%       0 6.07G 3.54G   0|0   1|0   792k   2.16m  764 rs-app_shard-imsi-1  PRI Aug  3 14:45:30.414
        *0     8    747    862      26   150|0  1.7% 77.2%       0 6.07G 3.54G   0|0   1|0   900k   2.32m  764 rs-app_shard-imsi-1  PRI Aug  3 14:45:31.416
        *0     5    789    721      26   175|0  1.8% 77.2%       0 6.07G 3.54G   0|0   1|3   881k   2.26m  764 rs-app_shard-imsi-1  PRI Aug  3 14:45:32.418
        *0     3    763    734      25   149|0  1.8% 77.2%       0 6.07G 3.54G   0|1   1|1   856k   2.11m  764 rs-app_shard-imsi-1  PRI Aug  3 14:45:33.422
        *0     3    535    587      27   150|0  1.8% 77.3%       0 6.07G 3.55G   0|0   1|1   653k   1.80m  764 rs-app_shard-imsi-1  PRI Aug  3 14:45:34.416
        *0     4    782    708      26   186|0  1.8% 77.3%       0 6.07G 3.55G   0|1   1|0   869k   2.19m  764 rs-app_shard-imsi-1  PRI Aug  3 14:45:35.415
        *0     4    360    298       7    65|0  1.8% 77.3%       0 6.07G 3.55G 0|391 5|128   382k    585k  764 rs-app_shard-imsi-1  PRI Aug  3 14:45:37.579
        *0     4    136    113      18  1287|0  1.9% 77.3%       0 5.83G 3.60G   0|4  1|53   523k   4.22m  490 rs-app_shard-imsi-1  PRI Aug  3 14:45:37.798
        *0     6   1378    646      19  3142|0  1.9% 77.3%       0 6.03G 3.62G   0|0  1|43  1.79m   4.92m  724 rs-app_shard-imsi-1  PRI Aug  3 14:45:38.524
  • supported logs3: name: mongotop27045.txt: 14:45:36 timestamp is missed

                               ns    total    read    write    2022-08-03T14:45:34Z
imsiApnbindings_1.imsiApnbindings    948ms     0ms    948ms
imsiApnbindings_2.imsiApnbindings    801ms     0ms    800ms
                   local.oplog.rs     13ms    13ms      0ms
                admin.system.keys      0ms     0ms      0ms
               admin.system.roles      0ms     0ms      0ms
               admin.system.users      0ms     0ms      0ms
             admin.system.version      0ms     0ms      0ms
           config.system.sessions      0ms     0ms      0ms
              config.transactions      0ms     0ms      0ms
  imsiApnbindings.imsiApnbindings      0ms     0ms      0ms

                               ns    total    read    write    2022-08-03T14:45:37Z
imsiApnbindings_2.imsiApnbindings    644ms     0ms    644ms
imsiApnbindings_1.imsiApnbindings    600ms     0ms    600ms
                   local.oplog.rs     18ms    18ms      0ms
                admin.system.keys      0ms     0ms      0ms
               admin.system.roles      0ms     0ms      0ms
               admin.system.users      0ms     0ms      0ms
             admin.system.version      0ms     0ms      0ms
           config.system.sessions      0ms     0ms      0ms
              config.transactions      0ms     0ms      0ms
  imsiApnbindings.imsiApnbindings      0ms     0ms      0ms

                                 ns       total      read       write    2022-08-03T14:45:38Z
  imsiApnbindings_2.imsiApnbindings    458048ms    1452ms    456595ms
  imsiApnbindings_1.imsiApnbindings    410508ms    5351ms    405157ms
                     local.oplog.rs        25ms      25ms         0ms
msisdnApnbindings.msisdnApnbindings         8ms       8ms         0ms
    imsiApnbindings.imsiApnbindings         6ms       6ms         0ms

  • supported logs3: we executed one sript which is collecting no of threads and no of file descriptors on mongo primary port. script details are given below:
    #!/bin/bash
    while true;do
     printf '%s > %s\n' "$(date '+%Y%m-%d %H:%M:%S:%3N')" "$(ps hH p 27401 | wc -l && ls -l /proc/27401/fd | wc -l)" >> thread.txt 2>&1
     sleep 1;
    done

output file name: thread.txt: 14:45:37 timesatmp is missed

2022-08-03 14:45:34 -> 818
1628
2022-08-03 14:45:35 -> 820
1630
2022-08-03 14:45:36 -> 820
1632
2022-08-03 14:45:38 -> 838
1668
2022-08-03 14:45:39 -> 1412
2814
2022-08-03 14:45:40 -> 1412
2814

  • supported logs4: name: top_vm_output.log

top - 14:45:36 up 56 min,  1 user,  load average: 9.93, 8.67, 7.39
Tasks: 379 total,   1 running, 285 sleeping,   0 stopped,   0 zombie
%Cpu(s): 30.0 us,  8.3 sy,  0.0 ni, 40.0 id, 18.9 wa,  0.0 hi,  2.9 si,  0.0 st
KiB Mem : 65969880 total, 36489024 free, 17296064 used, 12184792 buff/cache
KiB Swap:  1046524 total,  1046524 free,        0 used. 40957040 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
27401 root       5 -15 6367796 3.549g  40296 S  65.2  5.6  40:41.11 mongod
 3570 root      20   0    4792   1656   1372 D  55.3  0.0   0:28.48 gzip
13775 root       5 -15 5629592 3.483g  39692 S  14.9  5.5  11:11.79 mongod
13792 root       5 -15 5612844 3.473g  39568 S  14.9  5.5  11:35.67 mongod
13800 root       5 -15 5648408 3.509g  39656 S  14.6  5.6  11:06.83 mongod
 1369 root      20   0 2909596 101380  52392 S   9.9  0.2   8:23.87 dockerd
13802 root       5 -15 1976632 138672  38204 S   2.3  0.2   1:17.86 mongod
 8176 root      20   0  687072  16964   9080 S   2.0  0.0   0:42.09 node_exporter
13284 root      20   0  183120  59008  41724 S   2.0  0.1   1:11.44 consul

top - 14:45:40 up 56 min,  1 user,  load average: 9.78, 8.66, 7.39
Tasks: 381 total,   3 running, 283 sleeping,   0 stopped,   0 zombie
%Cpu(s): 39.0 us, 12.1 sy,  0.0 ni, 33.2 id, 10.3 wa,  0.0 hi,  5.5 si,  0.0 st
KiB Mem : 65969880 total, 36228116 free, 17424420 used, 12317344 buff/cache
KiB Swap:  1046524 total,  1046524 free,        0 used. 40724100 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
27401 root       5 -15 6968176 3.660g  40296 S 163.9  5.8  40:47.06 mongod
 3570 root      20   0    4792   1656   1372 R  56.2  0.0   0:30.52 gzip
13792 root       5 -15 5612844 3.473g  39568 S  16.0  5.5  11:36.25 mongod
13775 root       5 -15 5629592 3.483g  39692 S  15.7  5.5  11:12.36 mongod
13800 root       5 -15 5648408 3.508g  39656 S  15.2  5.6  11:07.38 mongod
 1369 root      20   0 2909596 101380  52392 S  11.6  0.2   8:24.29 dockerd
13284 root      20   0  183120  59008  41724 S   2.5  0.1   1:11.53 consul
13802 root       5 -15 1976632 138672  38204 S   2.5  0.2   1:17.95 mongod
13779 root       5 -15 1698528 108160  37556 S   1.9  0.2   0:50.10 mongod

  • supported logs5: name: jornalctlissue1.logs
  • supported logs6: kernel logs name: dmesg.txt
  • supported logs7: diagnostic.data{}

 

 

Occurance 2: Same observation were observed   that server did not respond for 16:00:36,16:00:37,16:00:38 timestamps

  • logs1: name: mongo-27045.log.issue2.gz : 16:00:39 time stamp came before 16:00:38 and 16:00:36 multiple times:
    2022-08-03T16:00:36.128+0000 D2 QUERY    [conn2189] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key: { _id: 1 }

    , name: "id", ns: "imsiApnbindings_1.imsiApnbindings" }
    2022-08-03T16:00:36.128+0000 D2 QUERY    [conn2189] Only one plan is available; it will be run but will not be cached. ns: imsiApnbindings_1.imsiApnbindings query:

    Unknown macro: { _id}

    } sort: {} projection: {}, planSummary: IXSCAN { _id: 1 }
    2022-08-03T16:00:36.128+0000 D2 QUERY    [conn2557] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:

    { _id: 1 }

    , name: "id", ns: "imsiApnbindings_1.imsiApnbindings" }
    2022-08-03T16:00:39.091+0000 D2 QUERY    [conn2557] Only one plan is available; it will be run but will not be cached. ns: imsiApnbindings_1.imsiApnbindings query:

    Unknown macro: { _id}

    } sort: {} projection: {}, planSummary: IXSCAN { _id: 1 }
    2022-08-03T16:00:36.128+0000 D2 QUERY    [conn2209] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:

    { _id: 1 }

    , name: "id", ns: "imsiApnbindings_2.imsiApnbindings" }
    2022-08-03T16:00:39.091+0000 D2 QUERY    [conn2209] Only one plan is available; it will be run but will not be cached. ns: imsiApnbindings_2.imsiApnbindings query:

    Unknown macro: { _id}

    } sort: {} projection: {}, planSummary: IXSCAN { _id: 1 }
    2022-08-03T16:00:36.129+0000 D2 QUERY    [conn2212] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:

    { _id: 1 }

    , name: "id", ns: "imsiApnbindings_1.imsiApnbindings" }
    2022-08-03T16:00:36.130+0000 D2 QUERY    [conn2462] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:

    { _id: 1 }

    , name: "id", ns: "imsiApnbindings_1.imsiApnbindings" }

  • logs2: 16:00:36,16:00:37,16:00:38 timestamps are missed in mongostat27045.txt
    insert query update delete getmore command dirty  used flushes vsize   res   qrw   arw net_in net_out conn                 set repl                time
        *0     5    705    652      27   168|0  2.4% 55.6%       0 6.42G 3.72G   0|0   1|1   797k   2.07m  762 rs-app_shard-imsi-1  PRI Aug  3 16:00:29.415
        *0     4    731    687      26   158|0  2.4% 55.6%       0 6.42G 3.72G   0|0   1|0   825k   2.05m  762 rs-app_shard-imsi-1  PRI Aug  3 16:00:30.415
        *0     7    747    765      26   142|0  2.5% 55.6%       0 6.42G 3.72G   0|1   1|0   862k   2.27m  762 rs-app_shard-imsi-1  PRI Aug  3 16:00:31.417
        *0     4    730    708      27   145|0  2.5% 55.6%       0 6.42G 3.72G   0|0   1|0   829k   2.16m  762 rs-app_shard-imsi-1  PRI Aug  3 16:00:32.414
        *0     2    719    623      26   146|0  2.5% 55.7%       0 6.42G 3.72G   0|3   1|0   790k   1.93m  762 rs-app_shard-imsi-1  PRI Aug  3 16:00:33.422
        *0     3    713    729      27   212|0  2.5% 55.7%       0 6.42G 3.72G   0|0   1|0   831k   2.27m  762 rs-app_shard-imsi-1  PRI Aug  3 16:00:34.416
        *0     4    761    703      25   168|0  2.6% 55.7%       0 6.42G 3.72G   0|1   1|0   849k   2.10m  762 rs-app_shard-imsi-1  PRI Aug  3 16:00:35.414
        *0     1    196    158       6    37|0  2.6% 55.7%       0 6.42G 3.72G 0|196 4|128   211k    422k  762 rs-app_shard-imsi-1  PRI Aug  3 16:00:39.121
        *0    28   1594    716       0   358|0  2.6% 55.7%       0 6.36G 3.76G 0|118 1|128  1.51m   2.76m  570 rs-app_shard-imsi-1  PRI Aug  3 16:00:39.227
        *0    *0     58     21      31  1610|0  2.6% 55.7%       0 5.85G 3.79G   0|0   1|0   393k   6.58m  210 rs-app_shard-imsi-1  PRI Aug  3 16:00:39.416
  • logs3: 16:00:36,16:00:37,16:00:38 timestamps are missed in mongotop27045.txt
                                   ns    total    read    write    2022-08-03T16:00:35Z
    imsiApnbindings_1.imsiApnbindings    379ms     0ms    379ms
    imsiApnbindings_2.imsiApnbindings    376ms     0ms    376ms
                       local.oplog.rs     13ms    13ms      0ms
                    admin.system.keys      0ms     0ms      0ms
                   admin.system.roles      0ms     0ms      0ms
                   admin.system.users      0ms     0ms      0ms
                 admin.system.version      0ms     0ms      0ms
               config.system.sessions      0ms     0ms      0ms
                  config.transactions      0ms     0ms      0ms
      imsiApnbindings.imsiApnbindings      0ms     0ms      0ms

                               ns    total    read    write    2022-08-03T16:00:39Z
imsiApnbindings_2.imsiApnbindings     77ms     0ms     77ms
imsiApnbindings_1.imsiApnbindings     53ms     0ms     53ms
                   local.oplog.rs      3ms     3ms      0ms
                admin.system.keys      0ms     0ms      0ms
               admin.system.roles      0ms     0ms      0ms
               admin.system.users      0ms     0ms      0ms
             admin.system.version      0ms     0ms      0ms
           config.system.sessions      0ms     0ms      0ms
              config.transactions      0ms     0ms      0ms
  imsiApnbindings.imsiApnbindings      0ms     0ms      0ms

                               ns       total       read       write    2022-08-03T16:00:40Z
imsiApnbindings_2.imsiApnbindings    805523ms     5389ms    800133ms
imsiApnbindings_1.imsiApnbindings    773690ms    10655ms    763034ms
                   local.oplog.rs        16ms       16ms         0ms
                admin.system.keys         0ms        0ms         0ms

  • logs4: top_vm_output.log
    top - 16:00:36 up  2:11,  3 users,  load average: 10.28, 8.53, 8.06
    Tasks: 393 total,   1 running, 294 sleeping,   3 stopped,   0 zombie
    %Cpu(s): 35.0 us,  7.8 sy,  0.0 ni, 45.7 id,  8.4 wa,  0.0 hi,  3.1 si,  0.0 st
    KiB Mem : 65969880 total, 26656104 free, 19064012 used, 20249764 buff/cache
    KiB Swap:  1046524 total,  1046524 free,        0 used. 37792588 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
27401 root       5 -15 6728284 3.724g  40360 S  74.2  5.9 108:10.39 mongod
 4489 root      20   0    4792   1572   1288 D  70.5  0.0   0:28.74 gzip
13792 root       5 -15 6111488 3.921g  39568 S  59.6  6.2  29:27.32 mongod
13800 root       5 -15 6077460 3.894g  39656 S  14.2  6.2  28:30.81 mongod
13775 root       5 -15 6128068 3.936g  39696 S  13.9  6.3  29:23.52 mongod
 1369 root      20   0 2909852 101944  52524 S   8.3  0.2  20:01.65 dockerd
13802 root       5 -15 1991472 153300  39548 S   2.3  0.2   3:21.02 mongod
 8176 root      20   0  687072  16964   9080 S   2.0  0.0   1:46.28 node_exporter
13284 root      20   0  183120  59240  41792 S   1.7  0.1   3:05.12 consul

top - 16:00:40 up  2:11,  3 users,  load average: 10.28, 8.53, 8.06
Tasks: 395 total,   3 running, 293 sleeping,   3 stopped,   0 zombie
%Cpu(s): 31.9 us,  6.8 sy,  0.0 ni, 35.4 id, 23.2 wa,  0.0 hi,  2.8 si,  0.0 st
KiB Mem : 65969880 total, 26688584 free, 19027816 used, 20253480 buff/cache
KiB Swap:  1046524 total,  1046524 free,        0 used. 37828984 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
13792 root       5 -15 6111488 3.921g  39568 S 104.4  6.2  29:30.86 mongod
27401 root       5 -15 6199976 3.789g  40360 S  33.0  6.0 108:11.51 mongod
13775 root       5 -15 6128068 3.936g  39696 S  22.7  6.3  29:24.29 mongod
 4489 root      20   0    4792   1572   1288 R  20.9  0.0   0:29.45 gzip
13800 root       5 -15 6077460 3.894g  39656 S  19.2  6.2  28:31.46 mongod
 1369 root      20   0 2909852 102104  52524 S   6.8  0.2  20:01.88 dockerd
13284 root      20   0  183120  59240  41792 S   2.7  0.1   3:05.21 consul
13802 root       5 -15 1991472 153300  39548 S   2.7  0.2   3:21.11 mongod
13779 root       5 -15 1698528 108160  37556 S   1.8  0.2   2:10.46 mongod

  • logs5: jornalctlissue2.logs
  • logs6: thread.txt
    2022-08-03 16:00:34 -> 816
    1624
    2022-08-03 16:00:36 -> 816
    1624
    2022-08-03 16:00:37 -> 816
    1619
    2022-08-03 16:00:40 -> 366
    724
    2022-08-03 16:00:41 -> 1207
  • logs7 and 8: dmesg.txt,diagnostic.data

NOTE: all 2 occurances logs are combined as combinedmongologs.tar.gz and will be uploaded to uploader portal.

Kindly provide the uploader portal link as it is showing as expired for me.

Thanks,
Kapil

Comment by KAPIL GUPTA [ 29/Jul/22 ]

Hi Christopher,

We tried to run gdb command several times but whenever we are running gdb command all connections are disconnecting on the primary. Please find some mongo logs as a reference given below when we run gdb command:

2022-07-27T10:33:30.814+0000 I CONNPOOL [RS] Dropping all pooled connections to [2606:ae00:3001:8311:172:16:244:bb]:27023 due to NetworkInterfaceExceededTimeLimit: Connection pool has been idle for longer than the host timeout

 

2022-07-27T10:35:11.262+0000 I REPL     [replexec-14] Member [2606:ae00:3001:8311:172:16:244:11]:27023 is now in state RS_DOWN
2022-07-27T10:35:11.262+0000 I REPL     [replexec-14] Member [2606:ae00:3001:8311:172:16:244:9]:27023 is now in state RS_DOWN
2022-07-27T10:35:11.262+0000 I REPL     [replexec-14] Member [2606:ae00:3001:8311:172:16:244:57]:27023 is now in state RS_DOWN
2022-07-27T10:35:11.262+0000 I REPL     [replexec-14] Member [2606:ae00:3001:8311:172:16:244:ab]:27023 is now in state RS_DOWN
2022-07-27T10:35:11.262+0000 I REPL     [replexec-14] can't see a majority of the set, relinquishing primary
2022-07-27T10:35:11.262+0000 I REPL     [replexec-14] Stepping down from primary in response to heartbeat

 

2022-07-27T10:35:11.274+0000 I REPL     [replexec-14] transition to SECONDARY from PRIMARY
2022-07-27T10:35:11.277+0000 I NETWORK  [replexec-14] Skip closing connection for connection # 1691
2022-07-27T10:35:11.277+0000 I NETWORK  [replexec-14] Skip closing connection for connection # 1690
2022-07-27T10:35:11.277+0000 I NETWORK  [replexec-14] Skip closing connection for connection # 1689

 

2022-07-27T10:35:11.288+0000 I NETWORK  [conn1431] Error sending response to client: SocketException: Broken pipe. Ending connection from [2606:ae00:3001:8311:172:16:244:bd]:43456 (connection id: 1431)
2022-07-27T10:35:11.288+0000 I NETWORK  [conn1602] Error sending response to client: SocketException: Broken pipe. Ending connection from [2606:ae00:3001:8311:172:16:244:c2]:51016 (connection id: 1602)

 

So we are unable to collect gdb because of above issue, could you please suggest any other alternative.

 

Thanks,

Kapil

Comment by KAPIL GUPTA [ 15/Jul/22 ]

Hi Christopher,

I am still working on gdb output, in the mean while we had some observation given below, whenever we are getting high query response arw value in mongostat value is increasing(usually it value is 1or 2), that is increasing load and cpu on mongo member(top output, mongostat and log message is given below for reference)

Instance : 14:01:23

  • Mongo log message: /data/mongod-node/db/mongo-27039.log:2022-07-12T14:01:23.462+0000 I  COMMAND  [conn2748] command ipv6bindings_1.ipv6bindings command: find { find: "ipv6bindings", filter: { _id: "3201:0000:0000:1452" }

    , limit: 1, singleBatch: true, $db: "ipv6bindings_1", $clusterTime:
    Unknown macro: { clusterTime}
    } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:514 locks:
    { ReplicationStateTransition: Unknown macro:

    { acquireCount}

    , Global: { acquireCount:

    { r: 2 }

    }, Database:

    Unknown macro: { acquireCount}

    storage:{} protocol:op_msg 818ms

 

  • Top output:

top - 14:01:23 up 80 days, 22:06,  1 user,  load average: 29.54, 15.91, 12.82

Tasks: 378 total,   2 running, 278 sleeping,   0 stopped,   0 zombie

%Cpu(s): 53.9 us,  8.1 sy,  0.0 ni, 30.5 id,  0.0 wa,  0.0 hi,  7.4 si,  0.0 st

KiB Mem : 65969880 total, 30794112 free, 22055376 used, 13120392 buff/cache

KiB Swap:  1046524 total,   992300 free,    54224 used. 33317032 avail Mem

 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND

8540 root       5 -15 6569936 3.722g  42096 S 182.5  5.9   6946:53 mongod

8536 root       5 -15 6585680 3.783g  41836 S 161.9  6.0   6973:49 mongod

8538 root       5 -15 6405456 3.947g  41620 S  49.7  6.3   5158:44 mongod

8537 root       5 -15 6526672 3.983g  42000 S  45.7  6.3   5151:51 mongod

18232 root      20   0  963048 146448   9436 S  16.9  0.2  14147:33 weaver

25927 root      20   0  217616 115828  62588 S  14.9  0.2   1974:22 consul

 

  • Mongostat(for 27039 port):

                                                               host insert query update delete getmore command dirty  used    flushes vsize   res  qrw    arw net_in net_out conn                   set repl                time

 ** 

[2606:ae00:3001:8311:172:16:244:e]:27039     *0   448    954    299     376   453|0  1.5% 76.4%       0 6.26G 3.72G  0|0    1|0  1.26m   1.63m  317 rs-sess-ipv6-shard-20  PRI Jul 12 14:01:22.424

                                                   localhost:27039     0   154    311     53      68   132|0  1.4% 76.3%       0 6.26G 3.72G    *0|52 83|128   307k    393k  317 rs-sess-ipv6-shard-20  PRI Jul 12 14:01:23.374

 

[2606:ae00:3001:8311:172:16:244:e]:27039     0   141    277     37      41    98|0  1.4% 76.3%       0 6.26G 3.72G      *24|77  11|39   253k    338k  317 rs-sess-ipv6-shard-20  PRI Jul 12 14:01:23.529

                                                   localhost:27039     *0   596   1525    440     305   391|0  1.5% 76.4%       0 6.26G 3.72G   0|0    1|1  1.55m   2.18m  317 rs-sess-ipv6-shard-20  PRI Jul 12 14:01:24.374

 

[2606:ae00:3001:8311:172:16:244:e]:27039     *0   669   1703    509     363   460|0  1.5% 76.4%       0 6.26G 3.72G   0|1    1|0  1.76m   2.47m  317 rs-sess-ipv6-shard-20  PRI Jul 12 14:01:24.422

                                                    localhost:27039     *0   574    915    302     355   430|0  1.5% 76.4%       0 6.26G 3.72G   0|0    1|1  1.24m   1.65m  317 rs-sess-ipv6-shard-20  PRI Jul 12 14:01:25.375

 

Information about our architecture:

  • Sharding Db details:

Here sharding db we are using as a meta data db which contains buckets and shards information. Application connects to sharding db and gets information about shards and based on bucket ids, stores data in uniform manner across all the shards.

sharding db member details(running inside container):
mongod --keyFile=/mongodb.key --ipv6 --bind_ip_all --replSet imsi-msisdn-sharddb --port 27019 --dbpath=/data/db/sharddb-27019 --logpath /data/db/mongo-27019.log --logappend --logRotate reopen

 

Actual shards information:

We have 2 sites[each sites have 7 VMs(each VM has 4 mongod data member and 3 arbiters running inside the container)]

Each VM RAM: 64 GB, each mongod data member has 7.6GB tmpfs as storage and 4.4 Gb as wiredTiger cache

ps -efww of one of the container:[each container has 1 primary and 3 secondary's and all secondaries and all members are from different replica sets]

root       993     1  1 Jul01 ?        04:46:57 mongod --keyFile=/mongodb.key --enableMajorityReadConcern false --ipv6 --bind_ip_all --port 27021 --dbpath=/data/db/wt-27021 --replSet rs-shard-2 --quiet --slowms 500 --logpath /data/db/mongo-27021.log --setParameter diagnosticDataCollectionEnabled=false --logappend --oplogSize 100 --logRotate reopen --wiredTigerCacheSizeGB .26 ----arbiter
root      1034     1  1 Jul01 ?        04:37:54 mongod --keyFile=/mongodb.key --enableMajorityReadConcern false --ipv6 --bind_ip_all --port 27022 --dbpath=/data/db/wt-27022 --replSet rs-shard-3 --quiet --slowms 500 --logpath /data/db/mongo-27022.log --setParameter diagnosticDataCollectionEnabled=false --logappend --oplogSize 100 --logRotate reopen --wiredTigerCacheSizeGB .26 ----arbiter
root      1227     1  1 Jul01 ?        04:49:42 mongod --keyFile=/mongodb.key --enableMajorityReadConcern false --ipv6 --bind_ip_all --port 27025 --dbpath=/data/db/wt-27025 --replSet rs-shard-6 --quiet --slowms 500 --logpath /data/db/mongo-27025.log --setParameter diagnosticDataCollectionEnabled=false --logappend --oplogSize 100 --logRotate reopen --wiredTigerCacheSizeGB .26 ----arbiter
root      1265     1 30 Jul01 ?        4-04:14:17 mongod --keyFile=/mongodb.key --enableMajorityReadConcern false --ipv6 --bind_ip_all --port 27028 --dbpath=/data/db/wt-27028 --replSet rs-shard-ipv6-1 --quiet --slowms 500 --logpath /data/db/mongo-27028.log --setParameter diagnosticDataCollectionEnabled=true --logappend --oplogSize 3221 --logRotate reopen --wiredTigerCacheSizeGB 4.40 ----datamember
root      1601     1 41 Jul01 ?        5-18:43:19 mongod --keyFile=/mongodb.key --enableMajorityReadConcern false --ipv6 --bind_ip_all --port 27030 --dbpath=/data/db/wt-27030 --replSet rs-shard-ipv6-3 --quiet --slowms 500 --logpath /data/db/mongo-27030.log --setParameter diagnosticDataCollectionEnabled=true --logappend --oplogSize 3221 --logRotate reopen --wiredTigerCacheSizeGB 4.40 ----datamember
root      1734     1 43 Jul01 ?        6-01:01:29 mongod {}keyFile=/mongodb.key {}enableMajorityReadConcern false {}ipv6 -bind_ip_all --port 27031 --dbpath=/data/db/wt-27031 --replSet rs-shard-ipv6-4 --quiet --slowms 500 --logpath /data/db/mongo-27031.log --setParameter diagnosticDataCollectionEnabled=true --logappend --oplogSize 3221 --logRotate reopen --wiredTigerCacheSizeGB 4.40 {}{}---datamember{}{}
root     32116     1 27 Jul04 ?        2-22:35:32 mongod {}keyFile=/mongodb.key {}enableMajorityReadConcern false {}ipv6 -bind_ip_all --port 27029 --dbpath=/data/db/wt-27029 --replSet rs-shard-ipv6-2 --quiet --slowms 500 --logpath /data/db/mongo-27029.log --setParameter diagnosticDataCollectionEnabled=true --logappend --oplogSize 3221 --logRotate reopen --wiredTigerCacheSizeGB 4.40 {}{}---datamember{}{}

df -hT inside one of the container:

root@mongo-s109:/# df -hT
Filesystem     Type     Size  Used Avail Use% Mounted on
overlay        overlay   97G   24G   70G  26% /
tmpfs          tmpfs     64M     0   64M   0% /dev
tmpfs          tmpfs     32G     0   32G   0% /sys/fs/cgroup
shm            tmpfs     64M     0   64M   0% /dev/shm
/dev/sda3      ext4      97G   24G   70G  26% /w
tmpfs          tmpfs    500M  301M  200M  61% /data/db/wt-27025 ----arbiter
tmpfs          tmpfs    500M  301M  200M  61% /data/db/wt-27022 ----arbiter
tmpfs          tmpfs    7.6G  1.5G  6.2G  20% /data/db/wt-27031 ----datamember
tmpfs          tmpfs    7.6G  1.5G  6.2G  20% /data/db/wt-27029 ----datamember
tmpfs          tmpfs    7.6G  1.5G  6.2G  20% /data/db/wt-27028 ----datamember
tmpfs          tmpfs    7.6G  1.5G  6.2G  20% /data/db/wt-27030 ----datamember
tmpfs          tmpfs    500M  301M  200M  61% /data/db/wt-27021 ----arbiter

NOTE: We are running site 1 as 4.0.27 and site 2 as 4.2.20, no high query response issue was seen on site 1 , but only site 2 we are seeing it. Out of 4 members some members are from site 1 and some are from site2, and all primaries are equally distributed on both sites.

 

Thanks,
Kapil

Comment by Chris Kelly [ 08/Jul/22 ]

Kapil,

That issue should be fixed by SERVER-56054 which affects 4.2.15, below the version you're running, so I'm not sure it is fully related. I'd also be confused about it resurfacing if you've resolved it the glibc issue yourself already.

Just to revisit some items, could you please clarify your topology? You mentioned you are running a sharded cluster, but are not using a mongos. This isn't a standard/recommended way of setting up a sharded cluster. Please explain:

  • How many shards are in your sharded cluster?
  • If you are not using a mongos, how are you handling routing?
  • You have a lot of memory available on the system running the primary - do you share the tmpfs with this memory pool or with any other services?

Looking at what we have available, it looks like this is being used as a replica set only, so I would want to confirm this before making any assumptions.

In addition, because we are not seeing anything that directly implicates the mongod at the time of latency yet, could you please get us the following information:

Tar up the information in the following directories and upload it:

  • /var/log/dmesg
  • /var/log/messages
  • /var/log/syslog

Re-acquire the logs and FTDC without --quiet (and at the same log level as before) - it looks like some diagnostic information may have been omitted as a result.

Get a stack trace at the time of the latency (this may be difficult, but we're running out of exact options to see what is truly happening at this time)

# collect stack trace samples with gdb
gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' >gdb.txt

Christopher

Comment by KAPIL GUPTA [ 08/Jul/22 ]

Hi Christopher,

I have uploaded logs before 00:45:56 also(file name : mongo-27029.log_before45_56.gz)

Thanks for your analysis. I have below queries for your recommendations:

  • consider the use of an additional drive to separate your data and journaling{}

<R> here we are using tmpfs(RAM) for both data storing and journaling, not using disk. 

  • tune your kernel dirty threshold

<R> As we are using same environment for 4.0.27 site as well, and we tried 3 times same repro scenerio(power off and on), but we did not get high query response on 4.0.27.

As we are using mongod in docker environment, is there any specific tuning we need do for it?

  • check for other processes that may be using your disk at the time before query latency occurs.

<R> We did not find any additional process which is consuming disk, is there any possibility that mongo itself is not responding like we face issue in  SERVER-63402(though we fixed it by adding glibc fix)

 

Thanks,

Kapil

Comment by Chris Kelly [ 08/Jul/22 ]

Kapil,

The first thing that seems to be happening in this timeline is that something is using up system memory and maxes out your disk utilization (point A: 2022-07-05T00:45:48.011Z) BEFORE you experience latency/stalling of mongod at point B (2022-07-05T00:45:59.076Z). However, without the logs I can't discern exactly what is happening with the primary node at that time. The logs would indicate if there was a slow query at this point that precipitated this disk utilization and subsequent stalling.

However, given that you run the same query filter at different points and with differing performance, it would be strange if the query was directly responsible here. The CPU usage/stalling of the mongod only happens after the disk receives lots of reads and writes at the same time:

Since the amount of documents scanned doesn't appear to notably change beyond your normal workload, it's strange how it would suddenly occur by re-running queries that normally run OK for you at other times.

One thing you can consider tuning your kernel dirty threshold since it appears to correlate with the high disk usage.

So, your logs begin at 2022-07-05T00:45:56.003+0000 which does not include the time period at which your disk usage begins (point A: 2022-07-05T00:45:48.011Z) so I don't know what precipitated it exactly. If you could supply the logs including the time covering that period just before, that might help see if a specific query triggered it (as it should show up as a slow query). I would also keep an eye on whether there may be other processes that start using your disk around when this happens. 

So, to summarize:

Possible Remediations/Workarounds:

  • tune your kernel dirty threshold
  • check for other processes that may be using your disk at the time before query latency occurs
  • consider colocating your journal, however this likely won't affect this behavior since journaling doesn't appear impacted by the latency.

More information we could use:

  • Provide the logs covering the time when hard drive disk usage increases (in this case, it was just a few seconds before your provided logs)

Regards,

Christopher

Comment by KAPIL GUPTA [ 05/Jul/22 ]

Please find my inline response as earlier:

"Just to clarify - you are running the same workload (and the same find queries) on the primary running 4.0, correct? Just double checking because I do not see output in your logs on 4.0 to confirm that the same queries are running."

<R> yeah it is running same workload , please find mongostat for reference for 4.0.27 primary member:

insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn             set repl                time
    *0    59    507    217     996  1207|0  1.6% 75.6%       0 5.17G 3.17G 0|0 1|0  2.44m   2.16m  290 rs-shard-ipv6-1  PRI Jul  5 05:30:01.815
    *0    55    534    199    1071  1291|0  1.6% 75.6%       0 5.17G 3.17G 0|0 1|0  2.62m   2.25m  290 rs-shard-ipv6-1  PRI Jul  5 05:30:02.796
    *0    49    504    207    1007  1198|0  1.6% 75.6%       0 5.17G 3.17G 0|0 1|0  2.42m   2.09m  290 rs-shard-ipv6-1  PRI Jul  5 05:30:03.796
    *0    46    523    172     943  1157|0  1.6% 75.6%       0 5.17G 3.17G 0|0 1|0  2.32m   2.06m  290 rs-shard-ipv6-1  PRI Jul  5 05:30:04.796
    *0    59    547    189     971  1216|0  1.6% 75.6%       0 5.17G 3.17G 0|0 1|0  2.46m   2.18m  290 rs-shard-ipv6-1  PRI Jul  5 05:30:05.792
    *0    56    504    195     969  1175|0  1.7% 75.7%       0 5.17G 3.17G 0|0 1|0  2.37m   2.05m  290 rs-shard-ipv6-1  PRI Jul  5 05:30:06.819
    *0    64    517    215    1030  1255|0  1.7% 75.7%       0 5.17G 3.17G 0|0 1|0  2.52m   2.17m  290 rs-shard-ipv6-1  PRI Jul  5 05:30:07.794
    *0    62    505    208    1030  1249|0  1.7% 75.7%       0 5.17G 3.17G 0|0 1|1  2.50m   2.14m  290 rs-shard-ipv6-1  PRI Jul  5 05:30:08.797
    *0    56    512    197     921  1107|0  1.7% 75.7%       0 5.17G 3.17G 0|0 1|0  2.25m   2.05m  290 rs-shard-ipv6-1  PRI Jul  5 05:30:09.794
    *0    51    530    197     962  1201|0  1.7% 75.7%       0 5.17G 3.17G 0|0 1|0  2.42m   2.12m  290 rs-shard-ipv6-1  PRI Jul  5 05:30:10.794

 

"4.2 Primary logs with mentioned verbosity "

<R> I have uploaded the asked logs , please find the details(tar file) given below:

cps@system1:~$ tar -tf mongo-27029_PRIMARY_verbose.tar.gz
mongo-27029-PRIMARY-verbose.log.gz
diagnostic.data/
diagnostic.data/metrics.2022-07-04T23-52-58Z-00000
diagnostic.data/metrics.interim

 

More details about the recreated issue:

  • Repro steps: 

          VM power off time: 2022-07-04T23:51:06

          VM power on time: 2022-07-04T23:52:50

          issue time: 2022-07-05T00:46:05

  • mongostat of 4.2.20 Primary

{}insert query update delete getmore command dirty  used flushes vsize   res qrw arw net_in net_out conn             set repl                time
    **    *0    53    513    199     826  1011|0  1.4% 61.1%       0 4.52G 2.41G 0|0 1|0  2.00m   1.99m  312 rs-shard-ipv6-2  PRI Jul  5 01:06:30.956
    *0    48    500    196     902  1075|0  1.4% 61.1%       0 4.52G 2.41G 0|0 1|0  2.09m   2.05m  312 rs-shard-ipv6-2  PRI Jul  5 01:06:31.955
    *0    49    510    200     844  1027|0  1.4% 61.1%       0 4.52G 2.41G 0|0 1|0  2.04m   2.04m  312 rs-shard-ipv6-2  PRI Jul  5 01:06:32.950
    *0    66    468    233     867  1077|0  1.4% 61.1%       0 4.52G 2.41G 0|0 1|1  2.10m   2.03m  312 rs-shard-ipv6-2  PRI Jul  5 01:06:33.961
    *0    53    465    209     853  1023|0  1.4% 61.2%       0 4.52G 2.41G 0|0 1|0  2.00m   1.99m  312 rs-shard-ipv6-2  PRI Jul  5 01:06:34.950
    *0    62    504    207     935  1109|0  1.4% 61.2%       0 4.52G 2.41G 0|0 1|2  2.17m   2.16m  312 rs-shard-ipv6-2  PRI Jul  5 01:06:35.951
    *0    54    472    198     875  1060|0  1.4% 61.2%       0 4.52G 2.41G 0|0 1|0  2.08m   2.00m  312 rs-shard-ipv6-2  PRI Jul  5 01:06:36.950
    *0    40    494    219     864  1038|0  1.5% 61.2%       0 4.52G 2.41G 0|0 1|0  2.04m   2.05m  312 rs-shard-ipv6-2  PRI Jul  5 01:06:37.949
    *0    51    467    204     845   984|0  1.5% 61.2%       0 4.52G 2.41G 0|0 1|0  1.95m   1.94m  312 rs-shard-ipv6-2  PRI Jul  5 01:06:38.950
    *0    47    536    207     934  1116|0  1.5% 61.2%       0 4.52G 2.41G 0|0 1|0  2.21m   2.19m  312 rs-shard-ipv6-2  PRI Jul  5 01:06:39.950

 

  • Suspected messages from latest logs(as a example):

Slow WT transaction. Lifetime of SnapshotId 19177106 was 6148ms

 

Thanks,

Kapil

Comment by Chris Kelly [ 01/Jul/22 ]

Kapil,

  • Just to clarify - you are running the same workload (and the same find queries) on the primary running 4.0, correct? Just double checking because I do not see output in your logs on 4.0 to confirm that the same queries are running.
  • Can you please resubmit the logs for 4.2 Primary with the following log level? There were uasserts near the end of your slow queries that were not visible in your logs that might better describe what was happening. Once you observe the issue again, we can look at that output.
    • db.adminCommand (will not persist across restarts), the following sets the default verbosity level to 1, the query to 2, the storage to 2, and the storage.journal to 1:

      db.adminCommand( {
         setParameter: 1,
         logComponentVerbosity: {
            verbosity: 1,
            query: { verbosity: 2 },
            storage: {
               verbosity: 2,
               journal: {
                  verbosity: 1
               }
            }
         }
      } )

      If you find this to be too much logging, you can try just setting it to loglevel: 1 for all instead. However, the extra information up front would potentially decrease the amount of back-and-forth on this.

So, generally speaking, the FTDC is showing a huge CPU usage spike at the time of your find() queries which affected the entire server. Readers and writers were being queued up until a uassert was output, and then your metrics returned to normal.

Queued readers/writers on 4.2 Primary:

2022-06-10T20:03:52.016+0000 I  COMMAND  [conn52] command ipv6bindings_1.ipv6bindings command: find { find: "ipv6bindings", filter: { _id: "3101:0000:0009:11c1" }

, limit: 1, singleBatch: true, $db: "ipv6bindings_1", $clusterTime:
Unknown macro: { clusterTime}
, $readPreference: { mode: "nearest" } } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:520 locks:{ ReplicationStateTransition:{ acquireCount:

{ w: 2 }

}
, Global:

Unknown macro: { acquireCount}

storage:{} protocol:op_msg 3441ms

It doesn't make sense, however, that your find() query would suddenly stop performing as it usually does (given that you ran other find queries with the same filter that ran <700ms).

In fact, it seems most work in general is stopping at this point, and it doesn't appear to be waiting on IO at first glance. The higher log level should help give more context around what is happening at this exact point.

Comment by KAPIL GUPTA [ 29/Jun/22 ]

Please find the in line response(as R) for respective details:

  • For your workload, please run explain(true) on the exact query/queries you are experiencing high response time issues with to get additional ExecutionStats information. The default mode is "queryPlanner". MongoDB interprets true as "allPlansExecution" and false as "queryPlanner". This will help narrow down possible reasons for the regression in response time.

<R> As there is our app functionality which is immediately updating and deleting records, those queries wont be present if we try to find it manually and  we did testing in different setups and queries are not same in all cases but if still it is required to collect info for the same , we can provide it by recreating the issue. 

  • The logs FTDC of your PRIMARY node on 4.0 running the workload you are mentioning (so we can compare with the 4.2 data you submitted)

<R> I have uploaded FTDC for the same on upload portal --> file name: mongo-4.0.27-PRIMARY.tar.gz

  • Provide some reproduction steps so we can also verify this independently:

<R> I have already added repro steps in comment section. kindly check "Troubleshooting performed" section in first comment.

    • What is the shape of your current documents in this case? Can you provide a sample document?{}

<R> Please find the mongo stat output for the primary member.

insert query update delete getmore command 
    *0    22    205     58     718   841|0 
    *0    12    219     86     769   880|0 
    *0    21    200     73     651   794|0 
    *0    20    219     74     696   857|0 
    *0    20    215     68     723   848|0 
    *0    19    212     62     741   867|0 
    *0    18    211     57     715   833|0 
    *0    15    215     67     790   902|0 
    *0    17    228     66     740   886|0 
    *0    23    206     66     726   855|0 

Sample document: 

rs-shard-ipv6-2:PRIMARY> db.drasessions.findOne()
{
        "_id" : {
                "sessionid" : "ClpGx0:172.16.241.111:5933:1654866756:0151474820"
        },
        "ts" : NumberLong("1656402314141"),
        "srk" : "server101.sitebstandalone",
        "originHost" : "site-b-client-calipers50-gx.pcef.gx-a",
        "originRealm" : "client-consumer.calipers.pcef.gx",
        "destHost" : "site-b-server-calipers101-gx.pcef.gx-8",
        "destRealm" : "server-consumer.calipers.pcef.gx",
        "apn" : "phone",
        "imsi" : "580034821",
        "ipv4" : "51.0.18.214",
        "ipv6" : "6206:ae00:0000:12d5",
        "msisdn" : "83404821",
        "nextEvalTime" : ISODate("2022-06-29T08:45:14.141Z"),
        "staleSessionExpiryCount" : 3,
        "staleBindingRefreshTime" : ISODate("2022-07-18T07:45:14.141Z"),
        "sessionid" : "ClpGx0:172.16.241.111:5933:1654866756:0151474820",
        "systemId" : "vpas-system-2",
        "uuid" : "vpas-system-2252020078"
}
rs-shard-ipv6-2:PRIMARY> use ipv6bindings_1
switched to db ipv6bindings_1
rs-shard-ipv6-2:PRIMARY> db.ipv6bindings.findOne()
{
        "_id" : "3101:0000:0002:dd53",
        "ts" : NumberLong("1656483427273"),
        "staleBindingExpiryTime" : ISODate("2022-07-09T06:17:07.274Z"),
        "srk" : "serverb1.relay",
        "fqdn" : "site-b-client-calipers21-gx.pcef.gx1",
        "sessionid" : "ClpGx0:172.16.241.111:5021:1654866656:0159487757",
        "uuid" : "vpas-system-21671833548",
        "systemId" : "vpas-system-2"
}

    • Do you have an index on this collection?

<R> yes, please find the index details given below:

Index details for  drasessions collection in drasessions_1 db:{}{}

rs-shard-ipv6-2:PRIMARY> db.drasessions.getIndexes()
[
        {
                "v" : 2,
                "key" : {
                        "systemId" : 1,
                        "nextEvalTime" : 1
                },
                "name" : "systemId_1_nextEvalTime_1",
                "ns" : "drasessions_1.drasessions"
        },
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "id",
                "ns" : "drasessions_1.drasessions"
        }
]
**

Index details for  ipv6bindings{} collection in ipv6bindings_1{db: ** 

rs-shard-ipv6-2:PRIMARY> db.ipv6bindings.getIndexes()
[
        {
                "v" : 2,
                "key" : {
                        "systemId" : 1,
                        "staleBindingExpiryTime" : 1
                },
                "name" : "systemId_1_staleBindingExpiryTime_1",
                "ns" : "ipv6bindings_1.ipv6bindings"
        },
        {
                "v" : 2,
                "key" : {
                        "source" : 1
                },
                "name" : "source_1",
                "ns" : "ipv6bindings_1.ipv6bindings"
        },
        {
                "v" : 2,
                "key" : {
                        "_id" : 1
                },
                "name" : "id",
                "ns" : "ipv6bindings_1.ipv6bindings"
        }
]

    • Is this running on a sharded collection? If so, is this the primary shard of the collection?

<R> yeah, we are using sharded collections which is storing shard information and actual data is stored on shards (as a replica set) for which we provided logs(primary and secondary's).

    • Do you observe this latency running the command on mongos or directly on the node? (Or both?)

<R> Not using mongos, using sharding db. Latency is observed on the shards(nodes).

  • I don't notice consistent spikes in query response time beyond the exact timestamp you mentioned (2022-06-10 20:03 UTC). Is this issue sporadic or is it always reproducible with a specific query? I notice you have constant workloads that don't have this latency normally, and I'm curious what the difference is between the queries you are triggering the latency with versus not triggering it with.

 <R> we observed 2 more instances for spikes but for those instance logs were not added as we got next occurrence  after 28 hours: details is  added in first comment of "Additional info" section.

Issue is reproducible but it is coming at random interval.

Issue is not coming for specific queries as we also checked in different setups.

Comment by Chris Kelly [ 27/Jun/22 ]

Kapil,

Thanks for your patience. After looking through your FTDC, I can observe the slow queries taking place specifically on your primary. I am not observing anything else really out of the ordinary on your secondary nodes. However, I don't have FTDC to show what your primary was behaving like before you upgraded to 4.2, so I don't really have much to compare against - and there's nothing immediately obvious as a culprit just yet.

It looks like this is happening to multiple queries and not just the one you ran explain() on. I'd like to dive deeper into that to get more information.

2022-06-10T20:03:52.009+0000 I  WRITE    [conn357] remove ipv6bindings_1.ipv6bindings command: { q: { _id: "6206:ae00:0000:b06b", sessionid: "ClpGx1:172.16.241.111:5934:1654866756:0002385157" }, limit: 1 } planSummary: IXSCAN { _id: 1 } keysExamined:0 docsExamined:0 ndeleted:0 numYields:0 queryHash:E3DFA639 planCacheKey:5384A64B locks:{ ParallelBatchWriterMode: { acquireCount: { r: 1 } }, ReplicationStateTransition: { acquireCount: { w: 1 } }, Global: { acquireCount: { w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } }, Mutex: { acquireCount: { r: 1 } } } flowControl:{ acquireCount: 1, timeAcquiringMicros: 1 } storage:{} 3461ms
2022-06-10T20:03:52.009+0000 I  WRITE    [conn264] remove ipv6bindings_1.ipv6bindings command: { q: { _id: "6206:ae00:0000:3b51", sessionid: "ClpGx0:172.16.241.111:5933:1654866756:0002385193" }, limit: 1 } planSummary: IXSCAN { _id: 1 } keysExamined:0 docsExamined:0 ndeleted:0 numYields:0 queryHash:E3DFA639 planCacheKey:5384A64B locks:{ ParallelBatchWriterMode: { acquireCount: { r: 1 } }, ReplicationStateTransition: { acquireCount: { w: 1 } }, Global: { acquireCount: { w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } }, Mutex: { acquireCount: { r: 1 } } } flowControl:{ acquireCount: 1 } storage:{} 3317ms2022-06-10T20:03:52.016+0000 I  COMMAND  [conn52] command ipv6bindings_1.ipv6bindings command: find { find: "ipv6bindings", filter: { _id: "3101:0000:0009:11c1" }, limit: 1, singleBatch: true, $db: "ipv6bindings_1", $clusterTime: { clusterTime: Timestamp(1654891428, 246), signature: { hash: BinData(0, 1AE068B7041E0DF364A48B3B76192706D425B50D), keyId: 7072624733400858628 } }, lsid: { id: UUID("55caafc8-fbb1-4ede-a1d8-dd788fadb341") }, $readPreference: { mode: "nearest" } } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:520 locks:{ ReplicationStateTransition: { acquireCount: { w: 2 } }, Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } }, Mutex: { acquireCount: { r: 1 } } } storage:{} protocol:op_msg 3441ms 

To look into what is exactly happening, and to better illustrate the regression taking place, can you please provide the following:

  • For your workload, please run explain(true) on the exact query/queries you are experiencing high response time issues with to get additional ExecutionStats information. The default mode is "queryPlanner". MongoDB interprets true as "allPlansExecution" and false as "queryPlanner". This will help narrow down possible reasons for the regression in response time.
  • The logs FTDC of your PRIMARY node on 4.0 running the workload you are mentioning (so we can compare with the 4.2 data you submitted)
  • Provide some reproduction steps so we can also verify this independently:
    • What is the shape of your current documents in this case? Can you provide a sample document?
    • Do you have an index on this collection?
    • Is this running on a sharded collection? If so, is this the primary shard of the collection?
    • Do you observe this latency running the command on mongos or directly on the node? (Or both?)
  • I don't notice consistent spikes in query response time beyond the exact timestamp you mentioned (2022-06-10 20:03 UTC). Is this issue sporadic or is it always reproducible with a specific query? I notice you have constant workloads that don't have this latency normally, and I'm curious what the difference is between the queries you are triggering the latency with versus not triggering it with.

I will also look into this further as well.

Christopher

Comment by KAPIL GUPTA [ 27/Jun/22 ]

Hi Christopher,

Could you please provide us any temporary WA for this issue as we have release planned soon(have requested from organization(cisco) email as well), it would be really help for us as we are blocked for the release.

 

Thanks,

Kapil

Comment by KAPIL GUPTA [ 24/Jun/22 ]

Hi Christopher,

  • Mongo logs on 4.0

Logs for 4.0.27 are attached in same zip file as out of 4 data bearing members 2 are from 4.0.27, please find the details given below for 4.0.27 members:

mongo-27029-SECONDARY_8a.tar.gz
mongo-27029-SECONDARY_8b.tar.gz

  • query plan

please find query plan for both 4.0.27 and 4.2.20 given below:

4.0.27:

rs-shard-ipv6-2:SECONDARY> db.drasessions.find().explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "drasessions_1.drasessions",
                "indexFilterSet" : false,
                "parsedQuery" :

{                 },
                "winningPlan" : {                         "stage" : "COLLSCAN",                         "direction" : "forward"                 },
                "rejectedPlans" : [ ]
        },
        "serverInfo" : {                 "host" : "mongo-s109",                 "port" : 27029,                 "version" : "4.0.27",                 "gitVersion" : "d47b151b55f286546e7c7c98888ae0577856ca20"         },
        "ok" : 1,
        "operationTime" : Timestamp(1656045744, 205),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1656045744, 205),
                "signature" : {                         "hash" : BinData(0,"AZMjfFZeJm0AtPqm+P1EKDCoj8Q="),                         "keyId" : NumberLong("7072624733400858628")                 }
        }
}

 

4.2.20:

rs-shard-ipv6-2:PRIMARY> db.drasessions.find().explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "drasessions_1.drasessions",
                "indexFilterSet" : false,
                "parsedQuery" : {                 }

,
                "queryHash" : "8B3D4AB8",
                "planCacheKey" : "8B3D4AB8",
                "winningPlan" :

{                         "stage" : "COLLSCAN",                         "direction" : "forward"                 }

,
                "rejectedPlans" : [ ]
        },
        "serverInfo" :

{                 "host" : "mongo-s109",                 "port" : 27029,                 "version" : "4.2.18",                 "gitVersion" : "f65ce5e25c0b26a00d091a4d24eec1a8b3a4c016"         }

,
        "ok" : 1,
        "$clusterTime" : {
                "clusterTime" : Timestamp(1656045392, 89),
                "signature" :

{                         "hash" : BinData(0,"ZLZuuzhWqgu39YINJXNZ1aspBds="),                         "keyId" : NumberLong("7072624733400858628")                 }

        },
        "operationTime" : Timestamp(1656045392, 89)
}

 

Thanks,

Kapil

Comment by Chris Kelly [ 23/Jun/22 ]

Hi Kapil,

Thank you for your patience! To look into this further, could you also provide the following from your testing:

  • Mongo logs on 4.0 as well as 4.2 (you submitted 4.2 already)
  • Add $explain to your queries so we can examine the query plan being used

Christopher

Comment by KAPIL GUPTA [ 21/Jun/22 ]

Hi Chris,

Gentle Reminder! did you get something from logs?

Thanks,

Kapil

Comment by KAPIL GUPTA [ 15/Jun/22 ]

Hi Chris,

I have uploaded file(mongo-diag-log.tar.gz- file structure is provided in description) on upload portal.

Kindly look into it.

 

Thanks,

Kapil

Comment by Chris Kelly [ 15/Jun/22 ]

Hi Kapil,

I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Let us know when you've gone ahead and submitted your logs.

Regards,
Christopher
 

Comment by KAPIL GUPTA [ 12/Jun/22 ]

Troubleshooting performed:

  • We stopped and restarted load , did not find high query response.
  • We stopped and started all mongo instances and then did load testing , did not find high query response.  **  
  • Repro steps more details: High query response are seen only the VM which we rebooted(errors frequency is not fixed), we performed the test in different setup also which ever VM we are rebooting, high query response are seen from that VM mongo Primary member.
  • We performed same tests with full load and 50% of full load and 25 % of full load and observed high query response in all cases.

Additional info:

  • VM reboot time: 2022-06-10T14:50:25
  • High query response is increased(~ 6 sec) in second occurrence given below:

2022-06-11T18:41:56(second occurrence).312+0000 I  COMMAND  [conn59] command ipv6bindings_1.ipv6bindings command: find { find: "ipv6bindings", filter:

{ _id: "3101:0000:0003:94da" }

, limit: 1, singleBatch: true, $db: "ipv6bindings_1", $clusterTime: { clusterTime: Timestamp(1654972910, 279), signature:{ hash: BinData(0, 8E670E972822B0FF4FB23BD426B9847B6A7521F4), keyId: 7072624733400858628 }}, lsid: { id: UUID("0aea1d8a-3134-4b7c-8978-6861a3a16626") }, $readPreference: { mode: "nearest" } } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:519 locks:{ ReplicationStateTransition: { acquireCount:{ w: 2 }}, Global: { acquireCount:{ r: 2 }}, Database: { acquireCount:{ r: 2 }}, Collection: { acquireCount:{ r: 2 }}, Mutex: { acquireCount:{ r: 1 }} } storage:{} protocol:op_msg 5926ms

  • High query response is decreased(~ 1 sec) in third occurrence given below:

2022-06-11T19:51:42(third occurence).497+0000 I  WRITE    [conn85] update drasessions_1.drasessions command: { q: { _id:{ sessionid: "ClpGx1:172.16.241.111:5022:1654866656:0010848441" }}, u: { $set:{ nextEvalTime: new Date(1655067101292), staleSessionExpiryCount: 3 }}, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 keysInserted:1 keysDeleted:1 numYields:1 locks:{ ParallelBatchWriterMode: { acquireCount:{ r: 2 }}, ReplicationStateTransition: { acquireCount:{ w: 2 }}, Global: {acquireCount:{ w: 2 }}, Database: { acquireCount:{ w: 2 }}, Collection: { acquireCount:{ w: 2 }}, Mutex: { acquireCount:r: 2 }} } flowControl:{ acquireCount: 2, timeAcquiringMicros: 1 } storage:{} 1203ms

Note: High query response are coming in random intervals and varying from 1 sec to 6 sec.

Generated at Thu Feb 08 06:07:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.