Priority: Major - P3
Affects Version/s: 3.6.17, 4.0.23
Fix Version/s: None
Sprint:Repl 2021-06-14, Repl 2021-06-28
Os Version: CentOS 8
CentOS Linux release 8.1.1911 (Core)
Mongo Version : 3.6.17
Storage Engine: mmapV1
Storage Type: Data Path in tmpfs
RAM: 160 GB
total used free shared buff/cache available
Mem: 157 7 32 51 117 96
Swap: 3 0 3
Recently we upgraded from 3.6.9 to 3.6.17 as 3.6.17 is the release supports CentOS8. After this migration we are experiencing frequent issues where in after 1-2 days of system run, the mongo members are not responsive (MongoShell as well as Java API). So far we have found this is affected only on the secondary members. While debugging, we found that the lsof for hanging mongo process shows huge number around 32K +. The netstat doesnt provide the similar number but once it reaches around 35K it crashes. The RAM and CPU are not the bottleneck as we have enough free memory. We are also running the dbPath in tempfs.
Based on the Production/Operation Checklist, ulimit also not a concern. But the vm.max_map_count we set as 65530 but, mongo recommends 128000. even if we increase the value, the replica member recovered immediately but the connection count and lsof is not reducing. We are just postponing the crash for may be another week. So we are not sure how this kernal parameter sould help.
We have seen similar issues reported in the JIRA but due to unresponsiveness from submitter, the cases got closed.
mongo logs shows
2021-02-24T22:50:02.603+0000 I - [listener] pthread_create failed: Resource temporarily unavailable2021-02-24T22:50:02.603+0000 W EXECUTOR [conn480777] Terminating session due to error: InternalError: failed to create service entry worker thread2021-02-24T22:50:03.686+0000 I - [listener] pthread_create failed: Resource temporarily unavailable2021-02-24T22:50:03.686+0000 W EXECUTOR [conn480778] Terminating session due to error: InternalError: failed to create service entry worker thread2021-02-24T22:50:03.690+0000 I - [listener] pthread_create failed: Resource temporarily unavailable2021-02-24T22:50:03.690+0000 W EXECUTOR [conn480779] Terminating session due to error: InternalError: failed to create service entry worker thread2021-02-24T22:50:03.709+0000 I - [listener] pthread_create failed: Resource temporarily unavailable
Attached the diagnostcs.data , systemctl output, rs.status(),rs.conf() and mongo logs.