|
Hi Christopher,
We tested with 4.2 and 4.4 and again we found same issue on 4.2 member.
This time we also collected disk read write but we did not find any high disk usage.
Kindly reopen the case and please provide link to upload the supported logs.
Thanks,
Kapil
|
|
Hi Kapil,
Thank you for your patience. We have looked through your data and and have not found information that implicates the mongod for your latency issues. Unfortunately, at this time that means we're unable to determine the cause of the behavior from the given information.
Some points that were interesting:
- In all cases you presented, there is high, unexplained disk usage (not from the mongod) that eventually leads the mongod start queueing for a few seconds until it resolves itself. This is unexplained because it does not correlate with activity in the journal (wt log log) or the block manager.
- Point A: 2022-07-05T00:45:48.308Z

- The data you provided comparing 4.0 and 4.2 was mentioned as being in two separate deployments ("sites" in you case), so it is possible there are differences between these two that could explain the issue
- In your 4.2 data, I see a cyclic high disk usage very 15 minutes from some unexplained process that leads to the mongod queueing. I don't see this on your 4.0 data (and I see multiple extra drives on your 4.0 instance, which further suggests your sites vary in some other way that can explain the problem)

I would suggest an investigation into what may be occurring on this system to cause this, since it is not evident that it is coming from the mongod. As it stands I am going to close this because it seems something else is doing this every 15 minutes.
If you like, you can take this discussion further by asking our community for help by posting on the MongoDB Developer Community Forums.
If the discussion there leads you to suspect a bug in the MongoDB server, then we'd want to investigate it as a possible bug here in the SERVER project. Specifically in this case, that would also entail figuring out the source of your disk activity here.
Regards,
Christopher
|
|
Hi Christopher,
Just a gentle reminder! please have a look.
Regards,
Kapil
|
|
Thanks much Christopher for quick response and providing me the link.
I have uploaded the combinedmongologs.tar.gz in the portal. Kindly let us know your observation.
Regards,
Kapil
|
|
Hi Kapil,
I've gone ahead and created another Amphora Portal link for you to upload to.
As for not being able to collect gdb, I would also suggest the SIGUSR2 signal but I believe this is only supported in newer versions of MongoDB. The backtrace would clarify what is actually being worked on, so without it I'll have to infer what bad behavior is going on. I can take another look at your next set of logs at the timestamp you specified and see if I can discern something interesting.
Christopher
|
|
Hi Christopher,
We did testing again and got total 3 occurances(collected logs for 2) for the issue.
This time, we collected additional logs also (top output, journalctl logs and dmesg, mongotop and mognostat).
Please find the details for the logs along with observation given below:
Occurrence 1: mongo server did not respond for Aug 3 14:45:36 timestamp which created multiple problems(high query response in next few second as client side it got disconnected message and it tried to reconnect which created pressure on mongo which increases cpu on mongo)
- supported logs1: in mongo logs(name: mongo-27045.log.issue1.gz), we observed that 3 the time stamps (14:45:35,14:45:36,14:45:37) were not in order, 14:45:37 came before 14:45:35 and 14:45:36 in some instance like given below:
2022-08-03T14:45:35.971+0000 D2 QUERY [conn283] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:
{ _id: 1 }
, name: "id", ns: "imsiApnbindings_1.imsiApnbindings" }
2022-08-03T14:45:37.553+0000 D2 QUERY [conn283] Only one plan is available; it will be run but will not be cached. ns: imsiApnbindings_1.imsiApnbindings query:
Unknown macro: { _id}
sort: {} projection: {}, planSummary: IXSCAN { _id: 1 }
2022-08-03T14:45:37.553+0000 I NETWORK [conn81] end connection [2606:ae00:3001:8311:172:16:244:ae]:54464 (748 connections now open)
2022-08-03T14:45:37.553+0000 I NETWORK [conn19] end connection [2606:ae00:3001:8311:172:16:244:bf]:46106 (747 connections now open)
2022-08-03T14:45:35.972+0000 I COMMAND [conn43] command admin.$cmd command: isMaster
{ ismaster: 1, $db: "admin", $clusterTime:
Unknown macro: \{ clusterTime}
} numYields:0 reslen:1072 locks:{} protocol:op_msg 0ms
2022-08-03T14:45:35.972+0000 D2 QUERY [conn1093] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:
{ _id: 1 }
, name: "id", ns: "imsiApnbindings_2.imsiApnbindings" }
2022-08-03T14:45:35.973+0000 D2 QUERY [conn1034] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:
{ _id: 1 }
, name: "id", ns: "imsiApnbindings_2.imsiApnbindings" }
2022-08-03T14:45:35.974+0000 D2 QUERY [conn1146] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:
{ _id: 1 }
, name: "id", ns: "imsiApnbindings_2.imsiApnbindings" }
2022-08-03T14:45:37.566+0000 I NETWORK [conn66] end connection [2606:ae00:3001:8311:172:16:244:c2]:35400 (737 connections now open)
2022-08-03T14:45:36.001+0000 I COMMAND [conn21] command admin.$cmd command: isMaster { ismaster: 1, $db: "admin", $clusterTime: { clusterTime: Timestamp(1659537935, 1283), signature:
{ hash: BinData(0, 04438EF10586291946AB881A538C20917AA3EDB7), keyId: 7126121256901935109 }
} } numYields:0 reslen:1072 locks:{} protocol:op_msg 0ms
2022-08-03T14:45:35.998+0000 I COMMAND [conn46] command admin.$cmd command: isMaster { ismaster: 1, $db: "admin", $clusterTime: { clusterTime: Timestamp(1659537935, 987), signature:
{ hash: BinData(0, 04438EF10586291946AB881A538C20917AA3EDB7), keyId: 7126121256901935109 }
} } numYields:0 reslen:1072 locks:{} protocol:op_msg 0ms
2022-08-03T14:45:36.001+0000 D2 QUERY [conn1408] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:
{ _id: 1 }
, name: "id", ns: "imsiApnbindings_2.imsiApnbindings" }
2022-08-03T14:45:36.005+0000 I COMMAND [conn45] command admin.$cmd command: isMaster { ismaster: 1, $db: "admin", $clusterTime: { clusterTime: Timestamp(1659537935, 1158), signature:
{ hash: BinData(0, 04438EF10586291946AB881A538C20917AA3EDB7), keyId: 7126121256901935109 }
} } numYields:0 reslen:1072 locks:{} protocol:op_msg 0ms
2022-08-03T14:45:37.566+0000 I NETWORK [conn46] end connection
- supported logs2: name: mongostat27045.txt: 14:45:36 timesatmp is missed
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
*0 4 749 715 25 147|0 1.7% 77.1% 0 6.07G 3.54G 0|1 1|3 847k 1.98m 764 rs-app_shard-imsi-1 PRI Aug 3 14:45:29.419
*0 5 633 795 27 149|0 1.7% 77.2% 0 6.07G 3.54G 0|0 1|0 792k 2.16m 764 rs-app_shard-imsi-1 PRI Aug 3 14:45:30.414
*0 8 747 862 26 150|0 1.7% 77.2% 0 6.07G 3.54G 0|0 1|0 900k 2.32m 764 rs-app_shard-imsi-1 PRI Aug 3 14:45:31.416
*0 5 789 721 26 175|0 1.8% 77.2% 0 6.07G 3.54G 0|0 1|3 881k 2.26m 764 rs-app_shard-imsi-1 PRI Aug 3 14:45:32.418
*0 3 763 734 25 149|0 1.8% 77.2% 0 6.07G 3.54G 0|1 1|1 856k 2.11m 764 rs-app_shard-imsi-1 PRI Aug 3 14:45:33.422
*0 3 535 587 27 150|0 1.8% 77.3% 0 6.07G 3.55G 0|0 1|1 653k 1.80m 764 rs-app_shard-imsi-1 PRI Aug 3 14:45:34.416
*0 4 782 708 26 186|0 1.8% 77.3% 0 6.07G 3.55G 0|1 1|0 869k 2.19m 764 rs-app_shard-imsi-1 PRI Aug 3 14:45:35.415
*0 4 360 298 7 65|0 1.8% 77.3% 0 6.07G 3.55G 0|391 5|128 382k 585k 764 rs-app_shard-imsi-1 PRI Aug 3 14:45:37.579
*0 4 136 113 18 1287|0 1.9% 77.3% 0 5.83G 3.60G 0|4 1|53 523k 4.22m 490 rs-app_shard-imsi-1 PRI Aug 3 14:45:37.798
*0 6 1378 646 19 3142|0 1.9% 77.3% 0 6.03G 3.62G 0|0 1|43 1.79m 4.92m 724 rs-app_shard-imsi-1 PRI Aug 3 14:45:38.524
- supported logs3: name: mongotop27045.txt: 14:45:36 timestamp is missed
ns total read write 2022-08-03T14:45:34Z
imsiApnbindings_1.imsiApnbindings 948ms 0ms 948ms
imsiApnbindings_2.imsiApnbindings 801ms 0ms 800ms
local.oplog.rs 13ms 13ms 0ms
admin.system.keys 0ms 0ms 0ms
admin.system.roles 0ms 0ms 0ms
admin.system.users 0ms 0ms 0ms
admin.system.version 0ms 0ms 0ms
config.system.sessions 0ms 0ms 0ms
config.transactions 0ms 0ms 0ms
imsiApnbindings.imsiApnbindings 0ms 0ms 0ms
ns total read write 2022-08-03T14:45:37Z
imsiApnbindings_2.imsiApnbindings 644ms 0ms 644ms
imsiApnbindings_1.imsiApnbindings 600ms 0ms 600ms
local.oplog.rs 18ms 18ms 0ms
admin.system.keys 0ms 0ms 0ms
admin.system.roles 0ms 0ms 0ms
admin.system.users 0ms 0ms 0ms
admin.system.version 0ms 0ms 0ms
config.system.sessions 0ms 0ms 0ms
config.transactions 0ms 0ms 0ms
imsiApnbindings.imsiApnbindings 0ms 0ms 0ms
ns total read write 2022-08-03T14:45:38Z
imsiApnbindings_2.imsiApnbindings 458048ms 1452ms 456595ms
imsiApnbindings_1.imsiApnbindings 410508ms 5351ms 405157ms
local.oplog.rs 25ms 25ms 0ms
msisdnApnbindings.msisdnApnbindings 8ms 8ms 0ms
imsiApnbindings.imsiApnbindings 6ms 6ms 0ms
- supported logs3: we executed one sript which is collecting no of threads and no of file descriptors on mongo primary port. script details are given below:
#!/bin/bash
while true;do
printf '%s > %s\n' "$(date '+%Y%m-%d %H:%M:%S:%3N')" "$(ps hH p 27401 | wc -l && ls -l /proc/27401/fd | wc -l)" >> thread.txt 2>&1
sleep 1;
done
output file name: thread.txt: 14:45:37 timesatmp is missed
2022-08-03 14:45:34 -> 818
1628
2022-08-03 14:45:35 -> 820
1630
2022-08-03 14:45:36 -> 820
1632
2022-08-03 14:45:38 -> 838
1668
2022-08-03 14:45:39 -> 1412
2814
2022-08-03 14:45:40 -> 1412
2814
- supported logs4: name: top_vm_output.log
top - 14:45:36 up 56 min, 1 user, load average: 9.93, 8.67, 7.39
Tasks: 379 total, 1 running, 285 sleeping, 0 stopped, 0 zombie
%Cpu(s): 30.0 us, 8.3 sy, 0.0 ni, 40.0 id, 18.9 wa, 0.0 hi, 2.9 si, 0.0 st
KiB Mem : 65969880 total, 36489024 free, 17296064 used, 12184792 buff/cache
KiB Swap: 1046524 total, 1046524 free, 0 used. 40957040 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27401 root 5 -15 6367796 3.549g 40296 S 65.2 5.6 40:41.11 mongod
3570 root 20 0 4792 1656 1372 D 55.3 0.0 0:28.48 gzip
13775 root 5 -15 5629592 3.483g 39692 S 14.9 5.5 11:11.79 mongod
13792 root 5 -15 5612844 3.473g 39568 S 14.9 5.5 11:35.67 mongod
13800 root 5 -15 5648408 3.509g 39656 S 14.6 5.6 11:06.83 mongod
1369 root 20 0 2909596 101380 52392 S 9.9 0.2 8:23.87 dockerd
13802 root 5 -15 1976632 138672 38204 S 2.3 0.2 1:17.86 mongod
8176 root 20 0 687072 16964 9080 S 2.0 0.0 0:42.09 node_exporter
13284 root 20 0 183120 59008 41724 S 2.0 0.1 1:11.44 consul
–
top - 14:45:40 up 56 min, 1 user, load average: 9.78, 8.66, 7.39
Tasks: 381 total, 3 running, 283 sleeping, 0 stopped, 0 zombie
%Cpu(s): 39.0 us, 12.1 sy, 0.0 ni, 33.2 id, 10.3 wa, 0.0 hi, 5.5 si, 0.0 st
KiB Mem : 65969880 total, 36228116 free, 17424420 used, 12317344 buff/cache
KiB Swap: 1046524 total, 1046524 free, 0 used. 40724100 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27401 root 5 -15 6968176 3.660g 40296 S 163.9 5.8 40:47.06 mongod
3570 root 20 0 4792 1656 1372 R 56.2 0.0 0:30.52 gzip
13792 root 5 -15 5612844 3.473g 39568 S 16.0 5.5 11:36.25 mongod
13775 root 5 -15 5629592 3.483g 39692 S 15.7 5.5 11:12.36 mongod
13800 root 5 -15 5648408 3.508g 39656 S 15.2 5.6 11:07.38 mongod
1369 root 20 0 2909596 101380 52392 S 11.6 0.2 8:24.29 dockerd
13284 root 20 0 183120 59008 41724 S 2.5 0.1 1:11.53 consul
13802 root 5 -15 1976632 138672 38204 S 2.5 0.2 1:17.95 mongod
13779 root 5 -15 1698528 108160 37556 S 1.9 0.2 0:50.10 mongod
- supported logs5: name: jornalctlissue1.logs
- supported logs6: kernel logs name: dmesg.txt
- supported logs7: diagnostic.data{}
Occurance 2: Same observation were observed that server did not respond for 16:00:36,16:00:37,16:00:38 timestamps
- logs1: name: mongo-27045.log.issue2.gz : 16:00:39 time stamp came before 16:00:38 and 16:00:36 multiple times:
2022-08-03T16:00:36.128+0000 D2 QUERY [conn2189] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:
{ _id: 1 }
, name: "id", ns: "imsiApnbindings_1.imsiApnbindings" }
2022-08-03T16:00:36.128+0000 D2 QUERY [conn2189] Only one plan is available; it will be run but will not be cached. ns: imsiApnbindings_1.imsiApnbindings query:
Unknown macro: { _id}
} sort: {} projection: {}, planSummary: IXSCAN { _id: 1 }
2022-08-03T16:00:36.128+0000 D2 QUERY [conn2557] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:
{ _id: 1 }
, name: "id", ns: "imsiApnbindings_1.imsiApnbindings" }
2022-08-03T16:00:39.091+0000 D2 QUERY [conn2557] Only one plan is available; it will be run but will not be cached. ns: imsiApnbindings_1.imsiApnbindings query:
Unknown macro: { _id}
} sort: {} projection: {}, planSummary: IXSCAN { _id: 1 }
2022-08-03T16:00:36.128+0000 D2 QUERY [conn2209] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:
{ _id: 1 }
, name: "id", ns: "imsiApnbindings_2.imsiApnbindings" }
2022-08-03T16:00:39.091+0000 D2 QUERY [conn2209] Only one plan is available; it will be run but will not be cached. ns: imsiApnbindings_2.imsiApnbindings query:
Unknown macro: { _id}
} sort: {} projection: {}, planSummary: IXSCAN { _id: 1 }
2022-08-03T16:00:36.129+0000 D2 QUERY [conn2212] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:
{ _id: 1 }
, name: "id", ns: "imsiApnbindings_1.imsiApnbindings" }
2022-08-03T16:00:36.130+0000 D2 QUERY [conn2462] Relevant index 0 is kp: { id: 1 } unique name: '(_id, )' io: { v: 2, key:
{ _id: 1 }
, name: "id", ns: "imsiApnbindings_1.imsiApnbindings" }
- logs2: 16:00:36,16:00:37,16:00:38 timestamps are missed in mongostat27045.txt
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
*0 5 705 652 27 168|0 2.4% 55.6% 0 6.42G 3.72G 0|0 1|1 797k 2.07m 762 rs-app_shard-imsi-1 PRI Aug 3 16:00:29.415
*0 4 731 687 26 158|0 2.4% 55.6% 0 6.42G 3.72G 0|0 1|0 825k 2.05m 762 rs-app_shard-imsi-1 PRI Aug 3 16:00:30.415
*0 7 747 765 26 142|0 2.5% 55.6% 0 6.42G 3.72G 0|1 1|0 862k 2.27m 762 rs-app_shard-imsi-1 PRI Aug 3 16:00:31.417
*0 4 730 708 27 145|0 2.5% 55.6% 0 6.42G 3.72G 0|0 1|0 829k 2.16m 762 rs-app_shard-imsi-1 PRI Aug 3 16:00:32.414
*0 2 719 623 26 146|0 2.5% 55.7% 0 6.42G 3.72G 0|3 1|0 790k 1.93m 762 rs-app_shard-imsi-1 PRI Aug 3 16:00:33.422
*0 3 713 729 27 212|0 2.5% 55.7% 0 6.42G 3.72G 0|0 1|0 831k 2.27m 762 rs-app_shard-imsi-1 PRI Aug 3 16:00:34.416
*0 4 761 703 25 168|0 2.6% 55.7% 0 6.42G 3.72G 0|1 1|0 849k 2.10m 762 rs-app_shard-imsi-1 PRI Aug 3 16:00:35.414
*0 1 196 158 6 37|0 2.6% 55.7% 0 6.42G 3.72G 0|196 4|128 211k 422k 762 rs-app_shard-imsi-1 PRI Aug 3 16:00:39.121
*0 28 1594 716 0 358|0 2.6% 55.7% 0 6.36G 3.76G 0|118 1|128 1.51m 2.76m 570 rs-app_shard-imsi-1 PRI Aug 3 16:00:39.227
*0 *0 58 21 31 1610|0 2.6% 55.7% 0 5.85G 3.79G 0|0 1|0 393k 6.58m 210 rs-app_shard-imsi-1 PRI Aug 3 16:00:39.416
- logs3: 16:00:36,16:00:37,16:00:38 timestamps are missed in mongotop27045.txt
ns total read write 2022-08-03T16:00:35Z
imsiApnbindings_1.imsiApnbindings 379ms 0ms 379ms
imsiApnbindings_2.imsiApnbindings 376ms 0ms 376ms
local.oplog.rs 13ms 13ms 0ms
admin.system.keys 0ms 0ms 0ms
admin.system.roles 0ms 0ms 0ms
admin.system.users 0ms 0ms 0ms
admin.system.version 0ms 0ms 0ms
config.system.sessions 0ms 0ms 0ms
config.transactions 0ms 0ms 0ms
imsiApnbindings.imsiApnbindings 0ms 0ms 0ms
ns total read write 2022-08-03T16:00:39Z
imsiApnbindings_2.imsiApnbindings 77ms 0ms 77ms
imsiApnbindings_1.imsiApnbindings 53ms 0ms 53ms
local.oplog.rs 3ms 3ms 0ms
admin.system.keys 0ms 0ms 0ms
admin.system.roles 0ms 0ms 0ms
admin.system.users 0ms 0ms 0ms
admin.system.version 0ms 0ms 0ms
config.system.sessions 0ms 0ms 0ms
config.transactions 0ms 0ms 0ms
imsiApnbindings.imsiApnbindings 0ms 0ms 0ms
ns total read write 2022-08-03T16:00:40Z
imsiApnbindings_2.imsiApnbindings 805523ms 5389ms 800133ms
imsiApnbindings_1.imsiApnbindings 773690ms 10655ms 763034ms
local.oplog.rs 16ms 16ms 0ms
admin.system.keys 0ms 0ms 0ms
- logs4: top_vm_output.log
top - 16:00:36 up 2:11, 3 users, load average: 10.28, 8.53, 8.06
Tasks: 393 total, 1 running, 294 sleeping, 3 stopped, 0 zombie
%Cpu(s): 35.0 us, 7.8 sy, 0.0 ni, 45.7 id, 8.4 wa, 0.0 hi, 3.1 si, 0.0 st
KiB Mem : 65969880 total, 26656104 free, 19064012 used, 20249764 buff/cache
KiB Swap: 1046524 total, 1046524 free, 0 used. 37792588 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
27401 root 5 -15 6728284 3.724g 40360 S 74.2 5.9 108:10.39 mongod
4489 root 20 0 4792 1572 1288 D 70.5 0.0 0:28.74 gzip
13792 root 5 -15 6111488 3.921g 39568 S 59.6 6.2 29:27.32 mongod
13800 root 5 -15 6077460 3.894g 39656 S 14.2 6.2 28:30.81 mongod
13775 root 5 -15 6128068 3.936g 39696 S 13.9 6.3 29:23.52 mongod
1369 root 20 0 2909852 101944 52524 S 8.3 0.2 20:01.65 dockerd
13802 root 5 -15 1991472 153300 39548 S 2.3 0.2 3:21.02 mongod
8176 root 20 0 687072 16964 9080 S 2.0 0.0 1:46.28 node_exporter
13284 root 20 0 183120 59240 41792 S 1.7 0.1 3:05.12 consul
–
top - 16:00:40 up 2:11, 3 users, load average: 10.28, 8.53, 8.06
Tasks: 395 total, 3 running, 293 sleeping, 3 stopped, 0 zombie
%Cpu(s): 31.9 us, 6.8 sy, 0.0 ni, 35.4 id, 23.2 wa, 0.0 hi, 2.8 si, 0.0 st
KiB Mem : 65969880 total, 26688584 free, 19027816 used, 20253480 buff/cache
KiB Swap: 1046524 total, 1046524 free, 0 used. 37828984 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13792 root 5 -15 6111488 3.921g 39568 S 104.4 6.2 29:30.86 mongod
27401 root 5 -15 6199976 3.789g 40360 S 33.0 6.0 108:11.51 mongod
13775 root 5 -15 6128068 3.936g 39696 S 22.7 6.3 29:24.29 mongod
4489 root 20 0 4792 1572 1288 R 20.9 0.0 0:29.45 gzip
13800 root 5 -15 6077460 3.894g 39656 S 19.2 6.2 28:31.46 mongod
1369 root 20 0 2909852 102104 52524 S 6.8 0.2 20:01.88 dockerd
13284 root 20 0 183120 59240 41792 S 2.7 0.1 3:05.21 consul
13802 root 5 -15 1991472 153300 39548 S 2.7 0.2 3:21.11 mongod
13779 root 5 -15 1698528 108160 37556 S 1.8 0.2 2:10.46 mongod
- logs5: jornalctlissue2.logs
- logs6: thread.txt
2022-08-03 16:00:34 -> 816
1624
2022-08-03 16:00:36 -> 816
1624
2022-08-03 16:00:37 -> 816
1619
2022-08-03 16:00:40 -> 366
724
2022-08-03 16:00:41 -> 1207
- logs7 and 8: dmesg.txt,diagnostic.data
NOTE: all 2 occurances logs are combined as combinedmongologs.tar.gz and will be uploaded to uploader portal.
Kindly provide the uploader portal link as it is showing as expired for me.
Thanks,
Kapil
|
|
Hi Christopher,
We tried to run gdb command several times but whenever we are running gdb command all connections are disconnecting on the primary. Please find some mongo logs as a reference given below when we run gdb command:
2022-07-27T10:33:30.814+0000 I CONNPOOL [RS] Dropping all pooled connections to [2606:ae00:3001:8311:172:16:244:bb]:27023 due to NetworkInterfaceExceededTimeLimit: Connection pool has been idle for longer than the host timeout
2022-07-27T10:35:11.262+0000 I REPL [replexec-14] Member [2606:ae00:3001:8311:172:16:244:11]:27023 is now in state RS_DOWN
2022-07-27T10:35:11.262+0000 I REPL [replexec-14] Member [2606:ae00:3001:8311:172:16:244:9]:27023 is now in state RS_DOWN
2022-07-27T10:35:11.262+0000 I REPL [replexec-14] Member [2606:ae00:3001:8311:172:16:244:57]:27023 is now in state RS_DOWN
2022-07-27T10:35:11.262+0000 I REPL [replexec-14] Member [2606:ae00:3001:8311:172:16:244:ab]:27023 is now in state RS_DOWN
2022-07-27T10:35:11.262+0000 I REPL [replexec-14] can't see a majority of the set, relinquishing primary
2022-07-27T10:35:11.262+0000 I REPL [replexec-14] Stepping down from primary in response to heartbeat
2022-07-27T10:35:11.274+0000 I REPL [replexec-14] transition to SECONDARY from PRIMARY
2022-07-27T10:35:11.277+0000 I NETWORK [replexec-14] Skip closing connection for connection # 1691
2022-07-27T10:35:11.277+0000 I NETWORK [replexec-14] Skip closing connection for connection # 1690
2022-07-27T10:35:11.277+0000 I NETWORK [replexec-14] Skip closing connection for connection # 1689
2022-07-27T10:35:11.288+0000 I NETWORK [conn1431] Error sending response to client: SocketException: Broken pipe. Ending connection from [2606:ae00:3001:8311:172:16:244:bd]:43456 (connection id: 1431)
2022-07-27T10:35:11.288+0000 I NETWORK [conn1602] Error sending response to client: SocketException: Broken pipe. Ending connection from [2606:ae00:3001:8311:172:16:244:c2]:51016 (connection id: 1602)
So we are unable to collect gdb because of above issue, could you please suggest any other alternative.
Thanks,
Kapil
|
|
Hi Christopher,
I am still working on gdb output, in the mean while we had some observation given below, whenever we are getting high query response arw value in mongostat value is increasing(usually it value is 1or 2), that is increasing load and cpu on mongo member(top output, mongostat and log message is given below for reference)
Instance : 14:01:23
- Mongo log message: /data/mongod-node/db/mongo-27039.log:2022-07-12T14:01:23.462+0000 I COMMAND [conn2748] command ipv6bindings_1.ipv6bindings command: find { find: "ipv6bindings", filter:
{ _id: "3201:0000:0000:1452" }
, limit: 1, singleBatch: true, $db: "ipv6bindings_1", $clusterTime:
Unknown macro: { clusterTime}
} planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:514 locks:
{ ReplicationStateTransition: Unknown macro:
{ acquireCount}
, Global: { acquireCount:
{ r: 2 }
}, Database:
Unknown macro: { acquireCount}
storage:{} protocol:op_msg 818ms
top - 14:01:23 up 80 days, 22:06, 1 user, load average: 29.54, 15.91, 12.82
Tasks: 378 total, 2 running, 278 sleeping, 0 stopped, 0 zombie
%Cpu(s): 53.9 us, 8.1 sy, 0.0 ni, 30.5 id, 0.0 wa, 0.0 hi, 7.4 si, 0.0 st
KiB Mem : 65969880 total, 30794112 free, 22055376 used, 13120392 buff/cache
KiB Swap: 1046524 total, 992300 free, 54224 used. 33317032 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8540 root 5 -15 6569936 3.722g 42096 S 182.5 5.9 6946:53 mongod
8536 root 5 -15 6585680 3.783g 41836 S 161.9 6.0 6973:49 mongod
8538 root 5 -15 6405456 3.947g 41620 S 49.7 6.3 5158:44 mongod
8537 root 5 -15 6526672 3.983g 42000 S 45.7 6.3 5151:51 mongod
18232 root 20 0 963048 146448 9436 S 16.9 0.2 14147:33 weaver
25927 root 20 0 217616 115828 62588 S 14.9 0.2 1974:22 consul
- Mongostat(for 27039 port):
host insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
**
[2606:ae00:3001:8311:172:16:244:e]:27039 *0 448 954 299 376 453|0 1.5% 76.4% 0 6.26G 3.72G 0|0 1|0 1.26m 1.63m 317 rs-sess-ipv6-shard-20 PRI Jul 12 14:01:22.424
localhost:27039 0 154 311 53 68 132|0 1.4% 76.3% 0 6.26G 3.72G *0|52 83|128 307k 393k 317 rs-sess-ipv6-shard-20 PRI Jul 12 14:01:23.374
[2606:ae00:3001:8311:172:16:244:e]:27039 0 141 277 37 41 98|0 1.4% 76.3% 0 6.26G 3.72G *24|77 11|39 253k 338k 317 rs-sess-ipv6-shard-20 PRI Jul 12 14:01:23.529
localhost:27039 *0 596 1525 440 305 391|0 1.5% 76.4% 0 6.26G 3.72G 0|0 1|1 1.55m 2.18m 317 rs-sess-ipv6-shard-20 PRI Jul 12 14:01:24.374
[2606:ae00:3001:8311:172:16:244:e]:27039 *0 669 1703 509 363 460|0 1.5% 76.4% 0 6.26G 3.72G 0|1 1|0 1.76m 2.47m 317 rs-sess-ipv6-shard-20 PRI Jul 12 14:01:24.422
localhost:27039 *0 574 915 302 355 430|0 1.5% 76.4% 0 6.26G 3.72G 0|0 1|1 1.24m 1.65m 317 rs-sess-ipv6-shard-20 PRI Jul 12 14:01:25.375
Information about our architecture:
Here sharding db we are using as a meta data db which contains buckets and shards information. Application connects to sharding db and gets information about shards and based on bucket ids, stores data in uniform manner across all the shards.
sharding db member details(running inside container):
mongod --keyFile=/mongodb.key --ipv6 --bind_ip_all --replSet imsi-msisdn-sharddb --port 27019 --dbpath=/data/db/sharddb-27019 --logpath /data/db/mongo-27019.log --logappend --logRotate reopen
Actual shards information:
We have 2 sites[each sites have 7 VMs(each VM has 4 mongod data member and 3 arbiters running inside the container)]
Each VM RAM: 64 GB, each mongod data member has 7.6GB tmpfs as storage and 4.4 Gb as wiredTiger cache
ps -efww of one of the container:[each container has 1 primary and 3 secondary's and all secondaries and all members are from different replica sets]
root 993 1 1 Jul01 ? 04:46:57 mongod --keyFile=/mongodb.key --enableMajorityReadConcern false --ipv6 --bind_ip_all --port 27021 --dbpath=/data/db/wt-27021 --replSet rs-shard-2 --quiet --slowms 500 --logpath /data/db/mongo-27021.log --setParameter diagnosticDataCollectionEnabled=false --logappend --oplogSize 100 --logRotate reopen --wiredTigerCacheSizeGB .26 ----arbiter
root 1034 1 1 Jul01 ? 04:37:54 mongod --keyFile=/mongodb.key --enableMajorityReadConcern false --ipv6 --bind_ip_all --port 27022 --dbpath=/data/db/wt-27022 --replSet rs-shard-3 --quiet --slowms 500 --logpath /data/db/mongo-27022.log --setParameter diagnosticDataCollectionEnabled=false --logappend --oplogSize 100 --logRotate reopen --wiredTigerCacheSizeGB .26 ----arbiter
root 1227 1 1 Jul01 ? 04:49:42 mongod --keyFile=/mongodb.key --enableMajorityReadConcern false --ipv6 --bind_ip_all --port 27025 --dbpath=/data/db/wt-27025 --replSet rs-shard-6 --quiet --slowms 500 --logpath /data/db/mongo-27025.log --setParameter diagnosticDataCollectionEnabled=false --logappend --oplogSize 100 --logRotate reopen --wiredTigerCacheSizeGB .26 ----arbiter
root 1265 1 30 Jul01 ? 4-04:14:17 mongod --keyFile=/mongodb.key --enableMajorityReadConcern false --ipv6 --bind_ip_all --port 27028 --dbpath=/data/db/wt-27028 --replSet rs-shard-ipv6-1 --quiet --slowms 500 --logpath /data/db/mongo-27028.log --setParameter diagnosticDataCollectionEnabled=true --logappend --oplogSize 3221 --logRotate reopen --wiredTigerCacheSizeGB 4.40 ----datamember
root 1601 1 41 Jul01 ? 5-18:43:19 mongod --keyFile=/mongodb.key --enableMajorityReadConcern false --ipv6 --bind_ip_all --port 27030 --dbpath=/data/db/wt-27030 --replSet rs-shard-ipv6-3 --quiet --slowms 500 --logpath /data/db/mongo-27030.log --setParameter diagnosticDataCollectionEnabled=true --logappend --oplogSize 3221 --logRotate reopen --wiredTigerCacheSizeGB 4.40 ----datamember
root 1734 1 43 Jul01 ? 6-01:01:29 mongod {}keyFile=/mongodb.key {}enableMajorityReadConcern false {}ipv6 -bind_ip_all --port 27031 --dbpath=/data/db/wt-27031 --replSet rs-shard-ipv6-4 --quiet --slowms 500 --logpath /data/db/mongo-27031.log --setParameter diagnosticDataCollectionEnabled=true --logappend --oplogSize 3221 --logRotate reopen --wiredTigerCacheSizeGB 4.40 {}{}---datamember{}{}
root 32116 1 27 Jul04 ? 2-22:35:32 mongod {}keyFile=/mongodb.key {}enableMajorityReadConcern false {}ipv6 -bind_ip_all --port 27029 --dbpath=/data/db/wt-27029 --replSet rs-shard-ipv6-2 --quiet --slowms 500 --logpath /data/db/mongo-27029.log --setParameter diagnosticDataCollectionEnabled=true --logappend --oplogSize 3221 --logRotate reopen --wiredTigerCacheSizeGB 4.40 {}{}---datamember{}{}
df -hT inside one of the container:
root@mongo-s109:/# df -hT
Filesystem Type Size Used Avail Use% Mounted on
overlay overlay 97G 24G 70G 26% /
tmpfs tmpfs 64M 0 64M 0% /dev
tmpfs tmpfs 32G 0 32G 0% /sys/fs/cgroup
shm tmpfs 64M 0 64M 0% /dev/shm
/dev/sda3 ext4 97G 24G 70G 26% /w
tmpfs tmpfs 500M 301M 200M 61% /data/db/wt-27025 ----arbiter
tmpfs tmpfs 500M 301M 200M 61% /data/db/wt-27022 ----arbiter
tmpfs tmpfs 7.6G 1.5G 6.2G 20% /data/db/wt-27031 ----datamember
tmpfs tmpfs 7.6G 1.5G 6.2G 20% /data/db/wt-27029 ----datamember
tmpfs tmpfs 7.6G 1.5G 6.2G 20% /data/db/wt-27028 ----datamember
tmpfs tmpfs 7.6G 1.5G 6.2G 20% /data/db/wt-27030 ----datamember
tmpfs tmpfs 500M 301M 200M 61% /data/db/wt-27021 ----arbiter
NOTE: We are running site 1 as 4.0.27 and site 2 as 4.2.20, no high query response issue was seen on site 1 , but only site 2 we are seeing it. Out of 4 members some members are from site 1 and some are from site2, and all primaries are equally distributed on both sites.
Thanks,
Kapil
|
|
Kapil,
That issue should be fixed by SERVER-56054 which affects 4.2.15, below the version you're running, so I'm not sure it is fully related. I'd also be confused about it resurfacing if you've resolved it the glibc issue yourself already.
Just to revisit some items, could you please clarify your topology? You mentioned you are running a sharded cluster, but are not using a mongos. This isn't a standard/recommended way of setting up a sharded cluster. Please explain:
- How many shards are in your sharded cluster?
- If you are not using a mongos, how are you handling routing?
- You have a lot of memory available on the system running the primary - do you share the tmpfs with this memory pool or with any other services?
Looking at what we have available, it looks like this is being used as a replica set only, so I would want to confirm this before making any assumptions.
In addition, because we are not seeing anything that directly implicates the mongod at the time of latency yet, could you please get us the following information:
Tar up the information in the following directories and upload it:
- /var/log/dmesg
- /var/log/messages
- /var/log/syslog
Re-acquire the logs and FTDC without --quiet (and at the same log level as before) - it looks like some diagnostic information may have been omitted as a result.
Get a stack trace at the time of the latency (this may be difficult, but we're running out of exact options to see what is truly happening at this time)
# collect stack trace samples with gdb
|
gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' >gdb.txt
|
Christopher
|
|
Hi Christopher,
I have uploaded logs before 00:45:56 also(file name : mongo-27029.log_before45_56.gz)
Thanks for your analysis. I have below queries for your recommendations:
- consider the use of an additional drive to separate your data and journaling{}
<R> here we are using tmpfs(RAM) for both data storing and journaling, not using disk.
- tune your kernel dirty threshold
<R> As we are using same environment for 4.0.27 site as well, and we tried 3 times same repro scenerio(power off and on), but we did not get high query response on 4.0.27.
As we are using mongod in docker environment, is there any specific tuning we need do for it?
- check for other processes that may be using your disk at the time before query latency occurs.
<R> We did not find any additional process which is consuming disk, is there any possibility that mongo itself is not responding like we face issue in SERVER-63402(though we fixed it by adding glibc fix)
Thanks,
Kapil
|
|
Kapil,

The first thing that seems to be happening in this timeline is that something is using up system memory and maxes out your disk utilization (point A: 2022-07-05T00:45:48.011Z) BEFORE you experience latency/stalling of mongod at point B (2022-07-05T00:45:59.076Z). However, without the logs I can't discern exactly what is happening with the primary node at that time. The logs would indicate if there was a slow query at this point that precipitated this disk utilization and subsequent stalling.
However, given that you run the same query filter at different points and with differing performance, it would be strange if the query was directly responsible here. The CPU usage/stalling of the mongod only happens after the disk receives lots of reads and writes at the same time:

Since the amount of documents scanned doesn't appear to notably change beyond your normal workload, it's strange how it would suddenly occur by re-running queries that normally run OK for you at other times.

One thing you can consider tuning your kernel dirty threshold since it appears to correlate with the high disk usage.

So, your logs begin at 2022-07-05T00:45:56.003+0000 which does not include the time period at which your disk usage begins (point A: 2022-07-05T00:45:48.011Z) so I don't know what precipitated it exactly. If you could supply the logs including the time covering that period just before, that might help see if a specific query triggered it (as it should show up as a slow query). I would also keep an eye on whether there may be other processes that start using your disk around when this happens.
So, to summarize:
Possible Remediations/Workarounds:
- tune your kernel dirty threshold
- check for other processes that may be using your disk at the time before query latency occurs
- consider colocating your journal, however this likely won't affect this behavior since journaling doesn't appear impacted by the latency.
More information we could use:
- Provide the logs covering the time when hard drive disk usage increases (in this case, it was just a few seconds before your provided logs)
Regards,
Christopher
|
|
Please find my inline response as earlier:
"Just to clarify - you are running the same workload (and the same find queries) on the primary running 4.0, correct? Just double checking because I do not see output in your logs on 4.0 to confirm that the same queries are running."
<R> yeah it is running same workload , please find mongostat for reference for 4.0.27 primary member:
insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
*0 59 507 217 996 1207|0 1.6% 75.6% 0 5.17G 3.17G 0|0 1|0 2.44m 2.16m 290 rs-shard-ipv6-1 PRI Jul 5 05:30:01.815
*0 55 534 199 1071 1291|0 1.6% 75.6% 0 5.17G 3.17G 0|0 1|0 2.62m 2.25m 290 rs-shard-ipv6-1 PRI Jul 5 05:30:02.796
*0 49 504 207 1007 1198|0 1.6% 75.6% 0 5.17G 3.17G 0|0 1|0 2.42m 2.09m 290 rs-shard-ipv6-1 PRI Jul 5 05:30:03.796
*0 46 523 172 943 1157|0 1.6% 75.6% 0 5.17G 3.17G 0|0 1|0 2.32m 2.06m 290 rs-shard-ipv6-1 PRI Jul 5 05:30:04.796
*0 59 547 189 971 1216|0 1.6% 75.6% 0 5.17G 3.17G 0|0 1|0 2.46m 2.18m 290 rs-shard-ipv6-1 PRI Jul 5 05:30:05.792
*0 56 504 195 969 1175|0 1.7% 75.7% 0 5.17G 3.17G 0|0 1|0 2.37m 2.05m 290 rs-shard-ipv6-1 PRI Jul 5 05:30:06.819
*0 64 517 215 1030 1255|0 1.7% 75.7% 0 5.17G 3.17G 0|0 1|0 2.52m 2.17m 290 rs-shard-ipv6-1 PRI Jul 5 05:30:07.794
*0 62 505 208 1030 1249|0 1.7% 75.7% 0 5.17G 3.17G 0|0 1|1 2.50m 2.14m 290 rs-shard-ipv6-1 PRI Jul 5 05:30:08.797
*0 56 512 197 921 1107|0 1.7% 75.7% 0 5.17G 3.17G 0|0 1|0 2.25m 2.05m 290 rs-shard-ipv6-1 PRI Jul 5 05:30:09.794
*0 51 530 197 962 1201|0 1.7% 75.7% 0 5.17G 3.17G 0|0 1|0 2.42m 2.12m 290 rs-shard-ipv6-1 PRI Jul 5 05:30:10.794
"4.2 Primary logs with mentioned verbosity "
<R> I have uploaded the asked logs , please find the details(tar file) given below:
cps@system1:~$ tar -tf mongo-27029_PRIMARY_verbose.tar.gz
mongo-27029-PRIMARY-verbose.log.gz
diagnostic.data/
diagnostic.data/metrics.2022-07-04T23-52-58Z-00000
diagnostic.data/metrics.interim
More details about the recreated issue:
VM power off time: 2022-07-04T23:51:06
VM power on time: 2022-07-04T23:52:50
issue time: 2022-07-05T00:46:05
- mongostat of 4.2.20 Primary
{}insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time
** *0 53 513 199 826 1011|0 1.4% 61.1% 0 4.52G 2.41G 0|0 1|0 2.00m 1.99m 312 rs-shard-ipv6-2 PRI Jul 5 01:06:30.956
*0 48 500 196 902 1075|0 1.4% 61.1% 0 4.52G 2.41G 0|0 1|0 2.09m 2.05m 312 rs-shard-ipv6-2 PRI Jul 5 01:06:31.955
*0 49 510 200 844 1027|0 1.4% 61.1% 0 4.52G 2.41G 0|0 1|0 2.04m 2.04m 312 rs-shard-ipv6-2 PRI Jul 5 01:06:32.950
*0 66 468 233 867 1077|0 1.4% 61.1% 0 4.52G 2.41G 0|0 1|1 2.10m 2.03m 312 rs-shard-ipv6-2 PRI Jul 5 01:06:33.961
*0 53 465 209 853 1023|0 1.4% 61.2% 0 4.52G 2.41G 0|0 1|0 2.00m 1.99m 312 rs-shard-ipv6-2 PRI Jul 5 01:06:34.950
*0 62 504 207 935 1109|0 1.4% 61.2% 0 4.52G 2.41G 0|0 1|2 2.17m 2.16m 312 rs-shard-ipv6-2 PRI Jul 5 01:06:35.951
*0 54 472 198 875 1060|0 1.4% 61.2% 0 4.52G 2.41G 0|0 1|0 2.08m 2.00m 312 rs-shard-ipv6-2 PRI Jul 5 01:06:36.950
*0 40 494 219 864 1038|0 1.5% 61.2% 0 4.52G 2.41G 0|0 1|0 2.04m 2.05m 312 rs-shard-ipv6-2 PRI Jul 5 01:06:37.949
*0 51 467 204 845 984|0 1.5% 61.2% 0 4.52G 2.41G 0|0 1|0 1.95m 1.94m 312 rs-shard-ipv6-2 PRI Jul 5 01:06:38.950
*0 47 536 207 934 1116|0 1.5% 61.2% 0 4.52G 2.41G 0|0 1|0 2.21m 2.19m 312 rs-shard-ipv6-2 PRI Jul 5 01:06:39.950
- Suspected messages from latest logs(as a example):
Slow WT transaction. Lifetime of SnapshotId 19177106 was 6148ms
Thanks,
Kapil
|
|
Kapil,
- Just to clarify - you are running the same workload (and the same find queries) on the primary running 4.0, correct? Just double checking because I do not see output in your logs on 4.0 to confirm that the same queries are running.
- Can you please resubmit the logs for 4.2 Primary with the following log level? There were uasserts near the end of your slow queries that were not visible in your logs that might better describe what was happening. Once you observe the issue again, we can look at that output.
- db.adminCommand (will not persist across restarts), the following sets the default verbosity level to 1, the query to 2, the storage to 2, and the storage.journal to 1:
db.adminCommand( {
|
setParameter: 1,
|
logComponentVerbosity: {
|
verbosity: 1,
|
query: { verbosity: 2 },
|
storage: {
|
verbosity: 2,
|
journal: {
|
verbosity: 1
|
}
|
}
|
}
|
} )
|
If you find this to be too much logging, you can try just setting it to loglevel: 1 for all instead. However, the extra information up front would potentially decrease the amount of back-and-forth on this.
So, generally speaking, the FTDC is showing a huge CPU usage spike at the time of your find() queries which affected the entire server. Readers and writers were being queued up until a uassert was output, and then your metrics returned to normal.
Queued readers/writers on 4.2 Primary:
| 2022-06-10T20:03:52.016+0000 I COMMAND [conn52] command ipv6bindings_1.ipv6bindings command: find { find: "ipv6bindings", filter:
{ _id: "3101:0000:0009:11c1" }
, limit: 1, singleBatch: true, $db: "ipv6bindings_1", $clusterTime:
Unknown macro: { clusterTime}
, $readPreference: { mode: "nearest" } } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:520 locks:{ ReplicationStateTransition:{ acquireCount:
{ w: 2 }
}
, Global:
Unknown macro: { acquireCount}
storage:{} protocol:op_msg 3441ms |

It doesn't make sense, however, that your find() query would suddenly stop performing as it usually does (given that you ran other find queries with the same filter that ran <700ms).

In fact, it seems most work in general is stopping at this point, and it doesn't appear to be waiting on IO at first glance. The higher log level should help give more context around what is happening at this exact point.
|
|
Please find the in line response(as R) for respective details:
- For your workload, please run explain(true) on the exact query/queries you are experiencing high response time issues with to get additional ExecutionStats information. The default mode is "queryPlanner". MongoDB interprets true as "allPlansExecution" and false as "queryPlanner". This will help narrow down possible reasons for the regression in response time.
<R> As there is our app functionality which is immediately updating and deleting records, those queries wont be present if we try to find it manually and we did testing in different setups and queries are not same in all cases but if still it is required to collect info for the same , we can provide it by recreating the issue.
- The logs FTDC of your PRIMARY node on 4.0 running the workload you are mentioning (so we can compare with the 4.2 data you submitted)
<R> I have uploaded FTDC for the same on upload portal --> file name: mongo-4.0.27-PRIMARY.tar.gz
- Provide some reproduction steps so we can also verify this independently:
<R> I have already added repro steps in comment section. kindly check "Troubleshooting performed" section in first comment.
-
- What is the shape of your current documents in this case? Can you provide a sample document?{}
<R> Please find the mongo stat output for the primary member.
insert query update delete getmore command
*0 22 205 58 718 841|0
*0 12 219 86 769 880|0
*0 21 200 73 651 794|0
*0 20 219 74 696 857|0
*0 20 215 68 723 848|0
*0 19 212 62 741 867|0
*0 18 211 57 715 833|0
*0 15 215 67 790 902|0
*0 17 228 66 740 886|0
*0 23 206 66 726 855|0
Sample document:
rs-shard-ipv6-2:PRIMARY> db.drasessions.findOne()
{
"_id" : {
"sessionid" : "ClpGx0:172.16.241.111:5933:1654866756:0151474820"
},
"ts" : NumberLong("1656402314141"),
"srk" : "server101.sitebstandalone",
"originHost" : "site-b-client-calipers50-gx.pcef.gx-a",
"originRealm" : "client-consumer.calipers.pcef.gx",
"destHost" : "site-b-server-calipers101-gx.pcef.gx-8",
"destRealm" : "server-consumer.calipers.pcef.gx",
"apn" : "phone",
"imsi" : "580034821",
"ipv4" : "51.0.18.214",
"ipv6" : "6206:ae00:0000:12d5",
"msisdn" : "83404821",
"nextEvalTime" : ISODate("2022-06-29T08:45:14.141Z"),
"staleSessionExpiryCount" : 3,
"staleBindingRefreshTime" : ISODate("2022-07-18T07:45:14.141Z"),
"sessionid" : "ClpGx0:172.16.241.111:5933:1654866756:0151474820",
"systemId" : "vpas-system-2",
"uuid" : "vpas-system-2252020078"
}
rs-shard-ipv6-2:PRIMARY> use ipv6bindings_1
switched to db ipv6bindings_1
rs-shard-ipv6-2:PRIMARY> db.ipv6bindings.findOne()
{
"_id" : "3101:0000:0002:dd53",
"ts" : NumberLong("1656483427273"),
"staleBindingExpiryTime" : ISODate("2022-07-09T06:17:07.274Z"),
"srk" : "serverb1.relay",
"fqdn" : "site-b-client-calipers21-gx.pcef.gx1",
"sessionid" : "ClpGx0:172.16.241.111:5021:1654866656:0159487757",
"uuid" : "vpas-system-21671833548",
"systemId" : "vpas-system-2"
}
-
- Do you have an index on this collection?
<R> yes, please find the index details given below:
Index details for drasessions collection in drasessions_1 db:{}{}
rs-shard-ipv6-2:PRIMARY> db.drasessions.getIndexes()
[
{
"v" : 2,
"key" : {
"systemId" : 1,
"nextEvalTime" : 1
},
"name" : "systemId_1_nextEvalTime_1",
"ns" : "drasessions_1.drasessions"
},
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "id",
"ns" : "drasessions_1.drasessions"
}
]
**
Index details for ipv6bindings{} collection in ipv6bindings_1{} db: **
rs-shard-ipv6-2:PRIMARY> db.ipv6bindings.getIndexes()
[
{
"v" : 2,
"key" : {
"systemId" : 1,
"staleBindingExpiryTime" : 1
},
"name" : "systemId_1_staleBindingExpiryTime_1",
"ns" : "ipv6bindings_1.ipv6bindings"
},
{
"v" : 2,
"key" : {
"source" : 1
},
"name" : "source_1",
"ns" : "ipv6bindings_1.ipv6bindings"
},
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "id",
"ns" : "ipv6bindings_1.ipv6bindings"
}
]
-
- Is this running on a sharded collection? If so, is this the primary shard of the collection?
<R> yeah, we are using sharded collections which is storing shard information and actual data is stored on shards (as a replica set) for which we provided logs(primary and secondary's).
-
- Do you observe this latency running the command on mongos or directly on the node? (Or both?)
<R> Not using mongos, using sharding db. Latency is observed on the shards(nodes).
- I don't notice consistent spikes in query response time beyond the exact timestamp you mentioned (2022-06-10 20:03 UTC). Is this issue sporadic or is it always reproducible with a specific query? I notice you have constant workloads that don't have this latency normally, and I'm curious what the difference is between the queries you are triggering the latency with versus not triggering it with.
<R> we observed 2 more instances for spikes but for those instance logs were not added as we got next occurrence after 28 hours: details is added in first comment of "Additional info" section.
Issue is reproducible but it is coming at random interval.
Issue is not coming for specific queries as we also checked in different setups.
|
|
Kapil,
Thanks for your patience. After looking through your FTDC, I can observe the slow queries taking place specifically on your primary. I am not observing anything else really out of the ordinary on your secondary nodes. However, I don't have FTDC to show what your primary was behaving like before you upgraded to 4.2, so I don't really have much to compare against - and there's nothing immediately obvious as a culprit just yet.
It looks like this is happening to multiple queries and not just the one you ran explain() on. I'd like to dive deeper into that to get more information.
2022-06-10T20:03:52.009+0000 I WRITE [conn357] remove ipv6bindings_1.ipv6bindings command: { q: { _id: "6206:ae00:0000:b06b", sessionid: "ClpGx1:172.16.241.111:5934:1654866756:0002385157" }, limit: 1 } planSummary: IXSCAN { _id: 1 } keysExamined:0 docsExamined:0 ndeleted:0 numYields:0 queryHash:E3DFA639 planCacheKey:5384A64B locks:{ ParallelBatchWriterMode: { acquireCount: { r: 1 } }, ReplicationStateTransition: { acquireCount: { w: 1 } }, Global: { acquireCount: { w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } }, Mutex: { acquireCount: { r: 1 } } } flowControl:{ acquireCount: 1, timeAcquiringMicros: 1 } storage:{} 3461ms
|
2022-06-10T20:03:52.009+0000 I WRITE [conn264] remove ipv6bindings_1.ipv6bindings command: { q: { _id: "6206:ae00:0000:3b51", sessionid: "ClpGx0:172.16.241.111:5933:1654866756:0002385193" }, limit: 1 } planSummary: IXSCAN { _id: 1 } keysExamined:0 docsExamined:0 ndeleted:0 numYields:0 queryHash:E3DFA639 planCacheKey:5384A64B locks:{ ParallelBatchWriterMode: { acquireCount: { r: 1 } }, ReplicationStateTransition: { acquireCount: { w: 1 } }, Global: { acquireCount: { w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } }, Mutex: { acquireCount: { r: 1 } } } flowControl:{ acquireCount: 1 } storage:{} 3317ms2022-06-10T20:03:52.016+0000 I COMMAND [conn52] command ipv6bindings_1.ipv6bindings command: find { find: "ipv6bindings", filter: { _id: "3101:0000:0009:11c1" }, limit: 1, singleBatch: true, $db: "ipv6bindings_1", $clusterTime: { clusterTime: Timestamp(1654891428, 246), signature: { hash: BinData(0, 1AE068B7041E0DF364A48B3B76192706D425B50D), keyId: 7072624733400858628 } }, lsid: { id: UUID("55caafc8-fbb1-4ede-a1d8-dd788fadb341") }, $readPreference: { mode: "nearest" } } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:520 locks:{ ReplicationStateTransition: { acquireCount: { w: 2 } }, Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } }, Mutex: { acquireCount: { r: 1 } } } storage:{} protocol:op_msg 3441ms
|
To look into what is exactly happening, and to better illustrate the regression taking place, can you please provide the following:
- For your workload, please run explain(true) on the exact query/queries you are experiencing high response time issues with to get additional ExecutionStats information. The default mode is "queryPlanner". MongoDB interprets true as "allPlansExecution" and false as "queryPlanner". This will help narrow down possible reasons for the regression in response time.
- The logs FTDC of your PRIMARY node on 4.0 running the workload you are mentioning (so we can compare with the 4.2 data you submitted)
- Provide some reproduction steps so we can also verify this independently:
- What is the shape of your current documents in this case? Can you provide a sample document?
- Do you have an index on this collection?
- Is this running on a sharded collection? If so, is this the primary shard of the collection?
- Do you observe this latency running the command on mongos or directly on the node? (Or both?)
- I don't notice consistent spikes in query response time beyond the exact timestamp you mentioned (2022-06-10 20:03 UTC). Is this issue sporadic or is it always reproducible with a specific query? I notice you have constant workloads that don't have this latency normally, and I'm curious what the difference is between the queries you are triggering the latency with versus not triggering it with.
I will also look into this further as well.
Christopher
|
|
Hi Christopher,
Could you please provide us any temporary WA for this issue as we have release planned soon(have requested from organization(cisco) email as well), it would be really help for us as we are blocked for the release.
Thanks,
Kapil
|
|
Hi Christopher,
Logs for 4.0.27 are attached in same zip file as out of 4 data bearing members 2 are from 4.0.27, please find the details given below for 4.0.27 members:
mongo-27029-SECONDARY_8a.tar.gz
mongo-27029-SECONDARY_8b.tar.gz
please find query plan for both 4.0.27 and 4.2.20 given below:
4.0.27:
rs-shard-ipv6-2:SECONDARY> db.drasessions.find().explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "drasessions_1.drasessions",
"indexFilterSet" : false,
"parsedQuery" :
{
},
"winningPlan" : {
"stage" : "COLLSCAN",
"direction" : "forward"
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "mongo-s109",
"port" : 27029,
"version" : "4.0.27",
"gitVersion" : "d47b151b55f286546e7c7c98888ae0577856ca20"
},
"ok" : 1,
"operationTime" : Timestamp(1656045744, 205),
"$clusterTime" : {
"clusterTime" : Timestamp(1656045744, 205),
"signature" : {
"hash" : BinData(0,"AZMjfFZeJm0AtPqm+P1EKDCoj8Q="),
"keyId" : NumberLong("7072624733400858628")
}
}
}
4.2.20:
rs-shard-ipv6-2:PRIMARY> db.drasessions.find().explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "drasessions_1.drasessions",
"indexFilterSet" : false,
"parsedQuery" : {
}
,
"queryHash" : "8B3D4AB8",
"planCacheKey" : "8B3D4AB8",
"winningPlan" :
{
"stage" : "COLLSCAN",
"direction" : "forward"
}
,
"rejectedPlans" : [ ]
},
"serverInfo" :
{
"host" : "mongo-s109",
"port" : 27029,
"version" : "4.2.18",
"gitVersion" : "f65ce5e25c0b26a00d091a4d24eec1a8b3a4c016"
}
,
"ok" : 1,
"$clusterTime" : {
"clusterTime" : Timestamp(1656045392, 89),
"signature" :
{
"hash" : BinData(0,"ZLZuuzhWqgu39YINJXNZ1aspBds="),
"keyId" : NumberLong("7072624733400858628")
}
},
"operationTime" : Timestamp(1656045392, 89)
}
Thanks,
Kapil
|
|
Hi Kapil,
Thank you for your patience! To look into this further, could you also provide the following from your testing:
- Mongo logs on 4.0 as well as 4.2 (you submitted 4.2 already)
- Add $explain to your queries so we can examine the query plan being used
Christopher
|
|
Hi Chris,
Gentle Reminder! did you get something from logs?
Thanks,
Kapil
|
|
Hi Chris,
I have uploaded file(mongo-diag-log.tar.gz- file structure is provided in description) on upload portal.
Kindly look into it.
Thanks,
Kapil
|
|
Hi Kapil,
I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.
For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:
- the mongod logs
- the $dbpath/diagnostic.data directory (the contents are described here)
Let us know when you've gone ahead and submitted your logs.
Regards,
Christopher
|
|
Troubleshooting performed:
- We stopped and restarted load , did not find high query response.
- We stopped and started all mongo instances and then did load testing , did not find high query response. **
- Repro steps more details: High query response are seen only the VM which we rebooted(errors frequency is not fixed), we performed the test in different setup also which ever VM we are rebooting, high query response are seen from that VM mongo Primary member.
- We performed same tests with full load and 50% of full load and 25 % of full load and observed high query response in all cases.
Additional info:
- VM reboot time: 2022-06-10T14:50:25
- High query response is increased(~ 6 sec) in second occurrence given below:
2022-06-11T18:41:56(second occurrence).312+0000 I COMMAND [conn59] command ipv6bindings_1.ipv6bindings command: find { find: "ipv6bindings", filter:
{ _id: "3101:0000:0003:94da" }
, limit: 1, singleBatch: true, $db: "ipv6bindings_1", $clusterTime: { clusterTime: Timestamp(1654972910, 279), signature:{ hash: BinData(0, 8E670E972822B0FF4FB23BD426B9847B6A7521F4), keyId: 7072624733400858628 }}, lsid: { id: UUID("0aea1d8a-3134-4b7c-8978-6861a3a16626") }, $readPreference: { mode: "nearest" } } planSummary: IDHACK keysExamined:1 docsExamined:1 cursorExhausted:1 numYields:1 nreturned:1 reslen:519 locks:{ ReplicationStateTransition: { acquireCount:{ w: 2 }}, Global: { acquireCount:{ r: 2 }}, Database: { acquireCount:{ r: 2 }}, Collection: { acquireCount:{ r: 2 }}, Mutex: { acquireCount:{ r: 1 }} } storage:{} protocol:op_msg 5926ms
- High query response is decreased(~ 1 sec) in third occurrence given below:
2022-06-11T19:51:42(third occurence).497+0000 I WRITE [conn85] update drasessions_1.drasessions command: { q: { _id:{ sessionid: "ClpGx1:172.16.241.111:5022:1654866656:0010848441" }}, u: { $set:{ nextEvalTime: new Date(1655067101292), staleSessionExpiryCount: 3 }}, multi: false, upsert: false } planSummary: IDHACK keysExamined:1 docsExamined:1 nMatched:1 nModified:1 keysInserted:1 keysDeleted:1 numYields:1 locks:{ ParallelBatchWriterMode: { acquireCount:{ r: 2 }}, ReplicationStateTransition: { acquireCount:{ w: 2 }}, Global: {acquireCount:{ w: 2 }}, Database: { acquireCount:{ w: 2 }}, Collection: { acquireCount:{ w: 2 }}, Mutex: { acquireCount:r: 2 }} } flowControl:{ acquireCount: 2, timeAcquiringMicros: 1 } storage:{} 1203ms
Note: High query response are coming in random intervals and varying from 1 sec to 6 sec.
|
Generated at Thu Feb 08 06:07:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.