[SERVER-63402] High query response time for find operation in mongo 4.0.27 with mmap storage engine with random intervals (5/7/12/20 hours) Created: 08/Feb/22  Updated: 28/Feb/22  Resolved: 28/Feb/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.0.27
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: KAPIL GUPTA Assignee: Edwin Zhou
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2022-02-22 at 3.12.10 PM.png     PNG File Screen Shot 2022-02-23 at 11.39.56 AM.png     Text File mogostat.txt     Text File rs.status.txt    
Issue Links:
Duplicate
duplicates SERVER-56054 Change minThreads value for replicati... Closed
Operating System: ALL
Participants:

 Description   

Scenario - We have application which is doing CRUD operations on database. From past 2 years we are using mongo 3.6.9 and system is working fine without any issues but when we upgraded the database from 3.6.9 to 4.0.27 and then we started getting high query response time for find operation but it is not consistent in nature ,sometimes we are getting high query response in 5 hours or 7 hours or even 11 hours.



 Comments   
Comment by Edwin Zhou [ 28/Feb/22 ]

Hi kg3634@gmail.com,

Thank you for following up that you've experience this issue more frequently after applying the patch.

I will now close this ticket as a duplicate of SERVER-56054 since it works around the glibc 2.27 bug. I recommend upgrading to a version of glibc that is not affected by this bug.

Best,
Edwin

Comment by KAPIL GUPTA [ 25/Feb/22 ]

Hi Edwin,

In our case after applying  patch (rebuilding and reinstalling same package with fix) for glibc 2.27, timeouts frequency got increased and for ubuntu 18.04 we could not be able to downgrade/upgrade glibc to different version as it is requiring other dependency to downgrade/upgrade in parallel that could disturb other running services. 

could we reduce the value for maxIdleThreadAge, if yes then what could be the minimum recommendation for the same?

 

Thanks,

Kapil

 

Comment by Edwin Zhou [ 24/Feb/22 ]

Hi kg3634@gmail.com,

Thank you for your patience. This appears to be the same bug with glibc 2.27 that SERVER-56054 attempts to work around. Prior to SERVER-56054, the server would have stalled indefinitely and we would have seen the oplog buffer completely fill and oplog application stall. After SERVER-56054, oplog application threads will timeout due to maxIdleThreadAge, which defaults to 30 seconds, the same duration as the stalls you observed.

According to the thread that you linked, the glibc patch you applied for 2.27 doesn't fix this bug, but reduces the likelihood of it happening on Ubuntu 18.04. Have you observed that the hangs occur less frequently after applying the patch?

You may be able to work around this by updating or downgrading glibc to a version that is not affected by this bug.

Best,
Edwin

Comment by Edwin Zhou [ 23/Feb/22 ]

Hi kg3634@gmail.com,

Thank you for uploading the mongod.log files and FTDC. This gives us good context around the gdb samples.

Best,
Edwin

Comment by KAPIL GUPTA [ 23/Feb/22 ]

Hi Edward,

I have uploaded mongologs on support uploader for unresponsive node, Please find the details given below:

File Name: mongolog20_02_2022_13_25.tar.gz

Unresponsive time duration - from 2022-02-20_13-25-11 to 2022-02-20_13-25-40 UTC

Thanks,
Kapil

Comment by Edwin Zhou [ 22/Feb/22 ]

Hi kg3634@gmail.com,

Thank you for uploading the gdb output. Can you please also upload an archive (tar or zip) of the mongod.log files and the $dbpath/diagnostic.data directory covering the timestamps where you collected stack traces? It will aid in the investigation. In the meantime, I will take a look at the stack traces.

Best,
Edwin

Comment by KAPIL GUPTA [ 22/Feb/22 ]

Hi Edwin,

Gentle Reminder!

It is just a gentle reminder for any WA as this issue is badly impacting our product and we have to validate this release to move further.

 

Thanks,

Kapil

Comment by KAPIL GUPTA [ 20/Feb/22 ]

Hi Edwin,

I have uploaded gdboutput on support uploader for unresponsive node, Please find the details given below:

File Name: gdboutput.tar.gz

Unresponsive time duration: from 2022-02-20_13-25-11 to 2022-02-20_13-25-40 UTC

Files list of gdboutput for 5 timestamps is given below:

gdboutput:
total 4724
27031-gdb_2022-02-20_13-25-16.txt
27031-gdb_2022-02-20_13-25-22.txt
27031-gdb_2022-02-20_13-25-27.txt
27031-gdb_2022-02-20_13-25-33.txt
27031-gdb_2022-02-20_13-25-38.txt

Please let me know if any other information is required.

Thanks,
Kapil

Comment by Edwin Zhou [ 17/Feb/22 ]

Hi kg3634@gmail.com,

Thank you for the update. Unfortunately, I'm not able to provide any workarounds since there's not enough information about the issue so far. I look forward to receiving the gdb output.

Best,
Edwin

Comment by KAPIL GUPTA [ 15/Feb/22 ]

Hi Edwin,

Thank you for the analysis. 

As part of 4.0.26 , MinThreadCount value is showing zero as given below:

rs-app_shardAB-ipv6-5:SECONDARY> db.adminCommand({getParameter:1 ,replWriterMinThreadCount: 1})
{
"replWriterMinThreadCount" : 0,
"ok" : 1,
"operationTime" : Timestamp(1644901703, 7),
"$clusterTime" : {
"clusterTime" : Timestamp(1644901703, 7),
"signature" :

{ "hash" : BinData(0,"B5XppE+gur52j/qqWJxPbMIk/gQ="), "keyId" : NumberLong("7042384427015471107") }

}}

For glibc 2.27 issue below WA we have already tried.

But we could not get issue resolved with above WA.

Could you please recommend any WA for the issue in the meanwhile,.

I would provide above gdb command output in sometime.

 

Thanks,
Kapil

Comment by Edwin Zhou [ 14/Feb/22 ]

Hi kg3634@gmail.com,

Thank you for providing initial diagnostic data across the cluster. I suspect this might be a reoccurrence of SERVER-56054, which fixes secondary hangs caused by a bug on glibc v2.27. However, SERVER-56054 was backported to 4.0.26, so seeing this issue on 4.0.27 would be unexpected.

In order to help confirm the behavior that's occurring, can you provide gdb on the node that is unresponsive the next time this behavior occurs?

gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' > gdb_`date +"%Y-%m-%d_%H-%M-%S"`.txt

Best,
Edwin

Comment by KAPIL GUPTA [ 12/Feb/22 ]

Hi Edward,

I have uploaded required logs on support uploader, Please find the details given below:

File Name: MongoQueryHighReponseLogs.tar.gz

HighQueryReponse Issue Time: 2022-02-11T22:51 UTC

Directory structure of logs file is given below:

QueryHighReponseLogs/:
Arbiter1 Arbiter2 Arbiter3 Primary rs.status Secondary1withHighReponse Secondary2 Secondary3

QueryHighReponseLogs/Arbiter1:
mongo-27035.log

QueryHighReponseLogs/Arbiter2:
mongo-27035.log

QueryHighReponseLogs/Arbiter3:
mongo-27035.log

QueryHighReponseLogs/Primary:
diagnostic.data mongo-27035.log

QueryHighReponseLogs/Primary/diagnostic.data:
metrics.2022-02-11T12-50-49Z-00000 metrics.2022-02-12T05-00-49Z-00000 metrics.interim

QueryHighReponseLogs/Secondary1withHighReponse:
diagnostic.data mongo-27035.log

QueryHighReponseLogs/Secondary1withHighReponse/diagnostic.data:
metrics.2022-02-11T12-35-12Z-00000 metrics.interim

QueryHighReponseLogs/Secondary2:
diagnostic.data mongo-27035.log

QueryHighReponseLogs/Secondary2/diagnostic.data:
metrics.2022-02-11T12-35-45Z-00000 metrics.interim

QueryHighReponseLogs/Secondary3:
diagnostic.data mongo-27035.log

QueryHighReponseLogs/Secondary3/diagnostic.data:
metrics.2022-02-11T12-50-14Z-00000 metrics.interim

Note: For Arbiter only mongod logs are present (diagnostics are false as it is only voting members ,does not hold actual data)

Please let me know any other info is required for analysis.

Thanks,

Kapil

Comment by Edwin Zhou [ 11/Feb/22 ]

Hi kg3634@gmail.com,

Thank you for your report. Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) for each node in the replica set and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Best,
Edwin

Comment by KAPIL GUPTA [ 08/Feb/22 ]

Additional Missing Information :

Testing Environment:
java client driver on application side - 3.12.9
mongo version - 4.0.27
Storage Engine - mmap
Replica-set : 7 members (4 non-arbiter and 3 arbiter , all voting members)
One of the member CMD as an example:
mongod --keyFile=/mongodb.key --storageEngine mmapv1 --nojournal --noprealloc --smallfiles --ipv6 --bind_ip_all --port 27035 --dbpath=/mmapv1-tmpfs-27035 --replSet rs-app_shardAB-ipv6-7 --quiet --slowms 500 --logpath /data/db/mongo-27035.log --oplogSize 3221 --setParameter diagnosticDataCollectionEnabled=true --logappend --logRotate reopen

 

Messages:
As a sample, we got these kind of messages on mongo secondary logs given below:

2022-02-02T02:55:54.392+0000 I COMMAND [conn554] command drasessions_1.drasessions command: find { find: “drasessions”, filter: { _id:

{ sessionid: “ClpGx3:172.16.241.40:15124:1643368779:0080300316” }

}, limit: 1, singleBatch: true, $db: “drasessions_1”, $clusterTime: { clusterTime: Timestamp(1643770525, 464), signature:

{ hash: BinData(0, A9E0739EB1E3BBA9EF776A9FCEC9342E9457D221), keyId: 7042384422720503811 }

}, lsid: { id: UUID(“8b321501-be08-4fa8-ada5-367cc1eb555e”) }, $readPreference: { mode: “nearest” } } planSummary: IDHACK keysExamined:0 docsExamined:0 cursorExhausted:1 numYields:0 nreturned:0 reslen:239 locks:{ Global: { acquireCount:

{ r: 2 }

, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 28911648 } }, MMAPV1Journal: { acquireCount:

{ r: 1 }

}, Database: { acquireCount:

{ r: 1 }

}, Collection: { acquireCount:

{ R: 1 }

} } protocol:op_msg 28911ms

 

Troubleshooting performed so far:

  • We have already checked network latency/CPU/RAM/disk space on VM and blade level, so there was no issue also we tested mongo 3.6.9 in same environment and configuration but there is no issue observed.
  • We enabled mongostat and also attached it as a reference and found below suspect point:
  1. For one of the secondary there are no CRUD operation for 26 sec (*0 *0 *0 *0 in mongostat observed), then suddenly high CRUD operation(*2444 110 *5781 *1816 in mongostat observed) were found on that secondary, there is no connection lost message found in mongostat for that secondary. This pattern is common whenever we are getting high response time(28 sec) on that secondary.
  2. For Primary it is always showing *0 for insert operation for all the time but on secondary’s replication is happening for insert like given below:

host insert query update delete getmore command flushes mapped vsize res faults qrw arw net_in net_out conn set repl time
[2606:ae00:3001:8311:172:16:244:59]:27032 *245 36 *595 *155 0 65|0 0 6.63G 4.49G 0 0|0 0|0 24.9k 124k 144 rs-app_shardAB-ipv6-4 SEC Feb 3 17:06:46.063
[2606:ae00:3001:8311:172:16:244:60]:27032 *0 *0 *0 *0 0 63|0 0 6.61G 4.49G 0 0|0 0|0 11.3k 71.9k 184 rs-app_shardAB-ipv6-4 SEC Feb 3 17:06:46.068
[2606:ae00:3001:8311:172:16:244:b]:27032 *227 40 *589 *155 0 64|0 0 6.75G 4.48G 0 0|0 0|0 26.6k 157k 144 rs-app_shardAB-ipv6-4 SEC Feb 3 17:06:46.075
[2606:ae00:3001:8311:172:16:244:c]:27032 *0 39 827 150 68 119|0 0 6.79G 4.49G 0 0|0 0|0 592k 1.49m 371 rs-app_shardAB-ipv6-4 PRI Feb 3 17:06:45.627
localhost:27032 *0 41 795 162 70 120|0 0 6.79G 4.49G 0 0|0 1|0 589k 1.50m 371 rs-app_shardAB-ipv6-4 PRI Feb 3 17:06:47.584

Queries:

  • There is no CRUD on one of the secondary for 26 sec when there is no connection lost between primary and secondary.
  • Always 0 insert operation is showing in Primary in mongostat and even replication for insert is happening on secondary.
  • This behavior is not observed in 3.6.9 where same configuration and environment was used.

Kindly reply to above queries it would really help us to process further as this is becoming a big blocker for us to use mongo in our environment.

Attachments:

  • We have attached mongostat output for one of the occurrence when we are getting high query response.
  • rs.status for one of the replica-set.
Generated at Thu Feb 08 05:57:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.