[SERVER-54805] Mongo become unresponsive, Spike in Connections and FD Created: 26/Feb/21 Updated: 13/Oct/21 Resolved: 24/Jun/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | MMAPv1 |
| Affects Version/s: | 3.6.17, 4.0.23 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | venkataramans rama | Assignee: | Lingzhi Deng |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Requested: |
v5.0
|
||||||||||||||||||||||||||||
| Sprint: | Repl 2021-06-14, Repl 2021-06-28 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
System Configuration Os Version: CentOS 8 Server Configuration: free -g Problem Description Recently we upgraded from 3.6.9 to 3.6.17 as 3.6.17 is the release supports CentOS8. After this migration we are experiencing frequent issues where in after 1-2 days of system run, the mongo members are not responsive (MongoShell as well as Java API). So far we have found this is affected only on the secondary members. While debugging, we found that the lsof for hanging mongo process shows huge number around 32K +. The netstat doesnt provide the similar number but once it reaches around 35K it crashes. The RAM and CPU are not the bottleneck as we have enough free memory. We are also running the dbPath in tempfs. Based on the Production/Operation Checklist, ulimit also not a concern. But the vm.max_map_count we set as 65530 but, mongo recommends 128000. even if we increase the value, the replica member recovered immediately but the connection count and lsof is not reducing. We are just postponing the crash for may be another week. So we are not sure how this kernal parameter sould help.
We have seen similar issues reported in the JIRA but due to unresponsiveness from submitter, the cases got closed. https://jira.mongodb.org/browse/SERVER-46701 https://jira.mongodb.org/browse/SERVER-40625 mongo logs shows 2021-02-24T22:50:02.603+0000 I - [listener] pthread_create failed: Resource temporarily unavailable2021-02-24T22:50:02.603+0000 W EXECUTOR [conn480777] Terminating session due to error: InternalError: failed to create service entry worker thread2021-02-24T22:50:03.686+0000 I - [listener] pthread_create failed: Resource temporarily unavailable2021-02-24T22:50:03.686+0000 W EXECUTOR [conn480778] Terminating session due to error: InternalError: failed to create service entry worker thread2021-02-24T22:50:03.690+0000 I - [listener] pthread_create failed: Resource temporarily unavailable2021-02-24T22:50:03.690+0000 W EXECUTOR [conn480779] Terminating session due to error: InternalError: failed to create service entry worker thread2021-02-24T22:50:03.709+0000 I - [listener] pthread_create failed: Resource temporarily unavailable
Attached the diagnostcs.data , systemctl output, rs.status(),rs.conf() and mongo logs. |
| Comments |
| Comment by Lingzhi Deng [ 24/Jun/21 ] | ||||||||||||
|
| ||||||||||||
| Comment by Edwin Zhou [ 19/Apr/21 ] | ||||||||||||
|
Hi veramasu@hcl.com, Thank you for following up, and I understand the security implications that come with downgrading. While we can't provide a specific timeline of when this fix will be available, we encourage you to follow Best, | ||||||||||||
| Comment by venkataramans rama [ 19/Apr/21 ] | ||||||||||||
|
Thank you Edwin for the confirmation. But unfortunately there is no lower version supported in glibc for centos8 and we can't downgrade to centos7 due to security vulnerabilities. May i know when the fix | ||||||||||||
| Comment by Edwin Zhou [ 19/Apr/21 ] | ||||||||||||
|
Hi veramasu@hcl.com, I apologize for my delay in my follow up and greatly appreciate your patience. We suspect this behavior stems from a glibc bug in v2.27 linked by neanton@gmail.com above. You can find additional details about the effects of this bug on To work around this issue at this time, you may safely continue to restart any hung mongod instances. If possible, we recommend downgrading to a safe version of glibc (we believe to be a version older than v2.27) or upgrading when this particular bug is patched. You may also consider downgrading to CentOS 7. Best, | ||||||||||||
| Comment by Andy Schwerin [ 14/Apr/21 ] | ||||||||||||
|
After a fair amount of research, I've filed | ||||||||||||
| Comment by venkataramans rama [ 12/Apr/21 ] | ||||||||||||
|
Hi Edwin, Do you have any latest finding that you want to share with us? | ||||||||||||
| Comment by Edwin Zhou [ 06/Apr/21 ] | ||||||||||||
|
Hi veramasu@hcl.com, We highly appreciate you taking the time to upgrade to MongoDB 4.0! We're continuing our investigation on this issue and will update this ticket with our findings. Best, | ||||||||||||
| Comment by venkataramans rama [ 06/Apr/21 ] | ||||||||||||
|
Yes @Anton, we have been running fine with centOS 7 for years without any issues. | ||||||||||||
| Comment by Anton Neznaienko [ 06/Apr/21 ] | ||||||||||||
|
If that’s a glibc issue https://sourceware.org/bugzilla/show_bug.cgi?id=25847 you can try downgrading to CentOS 7 and see if that helps. | ||||||||||||
| Comment by venkataramans rama [ 05/Apr/21 ] | ||||||||||||
|
Hi Edwin and Team, As suggested we have migrated the mongo from 3.6.17 to 4.0.23 but we are observing the same issue in 4.0.23 as well. I've uploaded 2 set of logs and from all 4 members of the RS. also the respective failed secondary GDB output as well. Could you please re-open the case? | ||||||||||||
| Comment by Edwin Zhou [ 24/Mar/21 ] | ||||||||||||
|
Hi veramasu@hcl.com, The ldd output that you provided doesn't make clear why we would be coming across this issue. Since we don't know exactly what's causing the oplog application thread to be stuck, and we don't have a working reproduction right now, we're not able to diagnose the issue. Given that MongoDB 3.6 is soon approaching EOL, we'd like to see if this issue persists in MongoDB 4.0, so I recommend upgrading to MongoDB 4.0 first so we can see whether this issue persists in a later version. In the meantime, I will close this ticket as incomplete. If the issue persists when you are able to upgrade to MongoDB 4.0, please let us know and we will be happy to investigate further! Best, | ||||||||||||
| Comment by venkataramans rama [ 24/Mar/21 ] | ||||||||||||
|
Hi Edwin, Can you please comment on the ldd difference between 3.6.9 and 3.6.17? | ||||||||||||
| Comment by Edwin Zhou [ 23/Mar/21 ] | ||||||||||||
|
Hi veramasu@hcl.com, We don't know if this issue will persist when upgrading to 4.0, but you will be able to continue using MMAPv1 after the upgrade. If you choose to upgrade to 4.2 and later, you will need to change your storage engine to WiredTiger.
Thanks, | ||||||||||||
| Comment by venkataramans rama [ 23/Mar/21 ] | ||||||||||||
|
Hi Edwin,
Thanks again for your response. But for us, it would take time to migrate to WT. even if we upgrade to 4.0, we would like to keep the Storage Engine as mmapV1.
Also one minor note on the dependent libraries difference between CentOS7 and 8. The following libraries are not shown in the ldd output of 3.6.17 in CentoS8. Will that make any difference? ldd /usr/bin/mongod (3.6.17 in CentOS8) linux-vdso.so.1 (0x00007ffde4188000) libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fcf497df000) libcrypto.so.1.1 => /lib64/libcrypto.so.1.1 (0x00007fcf492f9000) libssl.so.1.1 => /lib64/libssl.so.1.1 (0x00007fcf49065000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fcf48e61000) librt.so.1 => /lib64/librt.so.1 (0x00007fcf48c59000) libm.so.6 => /lib64/libm.so.6 (0x00007fcf488d7000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fcf486bf000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fcf4849f000) libc.so.6 => /lib64/libc.so.6 (0x00007fcf480dc000) /lib64/ld-linux-x86-64.so.2 (0x00007fcf4cb36000) libz.so.1 => /lib64/libz.so.1 (0x00007fcf47ec5000) ldd /usr/bin/mongod (in 3.6.9 CentOS 7) linux-vdso.so.1 => (0x00007ffc636fd000) libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fafb7105000) libcrypto.so.10 => /lib64/libcrypto.so.10 (0x00007fafb6ca3000) libssl.so.10 => /lib64/libssl.so.10 (0x00007fafb6a31000) libdl.so.2 => /lib64/libdl.so.2 (0x00007fafb682d000) librt.so.1 => /lib64/librt.so.1 (0x00007fafb6625000) libm.so.6 => /lib64/libm.so.6 (0x00007fafb6323000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fafb610d000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fafb5ef1000) libc.so.6 => /lib64/libc.so.6 (0x00007fafb5b24000) /lib64/ld-linux-x86-64.so.2 (0x00007fafba42a000) libz.so.1 => /lib64/libz.so.1 (0x00007fafb590e000) libgssapi_krb5.so.2 => /lib64/libgssapi_krb5.so.2 (0x00007fafb56c1000) libkrb5.so.3 => /lib64/libkrb5.so.3 (0x00007fafb53d8000) libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007fafb51d4000) libk5crypto.so.3 => /lib64/libk5crypto.so.3 (0x00007fafb4fa1000) libkrb5support.so.0 => /lib64/libkrb5support.so.0 (0x00007fafb4d91000) libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00007fafb4b8d000) libselinux.so.1 => /lib64/libselinux.so.1 (0x00007fafb4966000) libpcre.so.1 => /lib64/libpcre.so.1 (0x00007fafb4704000) | ||||||||||||
| Comment by Edwin Zhou [ 19/Mar/21 ] | ||||||||||||
|
Hi veramasu@hcl.com, We identified the behavior that occurs, but we don't exactly know why it happens. The MMAPv1 storage engine has been deprecated in favor of WiredTiger since MongoDB 4.0, which correlates with when this issue went away. You can find more information on the storage engine, as well as migrating to WiredTiger from MMAPv1 as a replica set in our docs. We'd love to hear back from you if this issue persists after you've upgraded to MongoDB 4.0. Best, | ||||||||||||
| Comment by venkataramans rama [ 17/Mar/21 ] | ||||||||||||
|
Hi Edwin, Thank you for the analysis.
Thanks, Venkataraman
| ||||||||||||
| Comment by Edwin Zhou [ 17/Mar/21 ] | ||||||||||||
|
Hi veramasu@hcl.com, Thank you for your patience while we investigate this issue. After taking a look at the gdb dump, it appears many threads on the secondary nodes are stuck waiting for the main oplog application thread. The main oplog application is waiting in SyncTail::multiApply.
This means it has divided up a batch of oplog entries and given them to the pool of 16 parallel application threads, and is now waiting for that work to complete in ThreadPool::waitForIdle. However none of those 16 worker threads is active, so they should have signaled the main oplog application thread that they had finished their work, but for some reason didn't. This issue has been identified in the past but we haven't resolved it. In more recent versions, this issue seems to have gone away. Since MongoDB version 3.6 is scheduled to go EOL next month as stated in our support policy, can you upgrade to 4.0 to see if this problem persists? Thanks, | ||||||||||||
| Comment by venkataramans rama [ 15/Mar/21 ] | ||||||||||||
|
Hi Edwin,
While I wait for your investigation, I see the following observation from currentOp whenever we hit the issue. The rsSync in currentOp shows Global W lock and may be it stuck at the lock. All the other operations including listDatabase are waiting for the lock. listDatabase in currentOp { { "host" : "SM10:27717", "desc" : "conn204498", "connectionId" : 204498, "client" : "127.0.0.1:59912", "appName" : "MongoDB Shell", "clientMetadata" : { "application" : { "name" : "MongoDB Shell" }, "driver" : { "name" : "MongoDB Internal Client", "version" : "3.6.17" }, "os" : { "type" : "Linux", "name" : "CentOS Linux release 8.1.1911 (Core) ", "architecture" : "x86_64", "version" : "Kernel 4.18.0-193.14.2.el8_2.x86_64" } }, "active" : true, "currentOpTime" : "2021-03-11T00:24:12.761+0000", "opid" : 736216208, "killPending" : true, "secs_running" : NumberLong(6535), "microsecs_running" : NumberLong("6535554063"), "op" : "command", "ns" : "admin.$cmd", "command" : { "listDatabases" : 1, "lsid" : { "id" : UUID("f3d38946-56de-497c-bacb-f8cf77357ca8") }, "$readPreference" : { "mode" : "secondaryPreferred" }, "$db" : "admin" }, "numYields" : 0, "locks" : { "Global" : "r" }, "waitingForLock" : true, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(1) }, "acquireWaitCount" : { "r" : NumberLong(1) }, "timeAcquiringMicros" : { "r" : NumberLong("6535434929") } } } },
rsSync in currentOp "host" : "SM10:27717", "host" : "SM10:27717", "desc" : "rsSync", "active" : true, "currentOpTime" : "2021-03-11T00:24:12.760+0000", "opid" : 735821910, "op" : "none", "ns" : "", "command" : { }, "numYields" : 0, "locks" : { "Global" : "W" }, "waitingForLock" : false, "lockStats" : { "Global" : { "acquireCount" : { "r" : NumberLong(3), "w" : NumberLong(3), "W" : NumberLong(1) }}, "MMAPV1Journal" : { "acquireCount" : { "r" : NumberLong(2), "w" : NumberLong(6) }}, "Database" : { "acquireCount" : { "r" : NumberLong(2), "w" : NumberLong(3) }}, "Collection" : { "acquireCount" : { "R" : NumberLong(1), "W" : NumberLong(3) }} } }, | ||||||||||||
| Comment by Edwin Zhou [ 04/Mar/21 ] | ||||||||||||
|
Hi veramasu@hcl.com, I appreciate your patience while we investigate this issue. We are still reviewing the diagnostics that you've provided us and will post updates to this ticket as we further understand this issue. Best, | ||||||||||||
| Comment by venkataramans rama [ 04/Mar/21 ] | ||||||||||||
|
Hi Edwin, | ||||||||||||
| Comment by venkataramans rama [ 02/Mar/21 ] | ||||||||||||
|
Thank you for setting up the secure portal. I have uploaded GDB output from 2 members. | ||||||||||||
| Comment by Dmitry Agranat [ 02/Mar/21 ] | ||||||||||||
|
Hi veramasu@hcl.com, you can upload everything into this secure portal | ||||||||||||
| Comment by venkataramans rama [ 02/Mar/21 ] | ||||||||||||
|
Hi Edwin, Thanks for your initial analysis and response. FYI, we have quickly tried with the latest mongo 3.6.22 as well and we run into the same issue. the GDB dump and logs are around 250MB. Is there any other place where I can upload such huge logs? Thanks, Venkataraman | ||||||||||||
| Comment by Edwin Zhou [ 01/Mar/21 ] | ||||||||||||
|
Hi veramasu@hcl.com, Thank you for reporting this issue and providing logs and FTDC. It really helps with beginning this investigation. To help identify why a particular node is unresponsive, would you please collect gdb from the unresponsive node the next time this issue occurs and upload the file?
Best, |