|
Hi,
We've a sharding, based on 8 servers, with 4 replicaset with this structure:
- replicaset1: server01a / server01b
- replicaset2: server02a / server02b
- replicaset3: server03a / server03b
- replicaset4: server04a / server04b
The servers are physical servers, have SSD, 32 threads and 256Gb of RAM.
The mongodb config on each node is similar to this one:
storage:
|
dbPath: /var/lib/mongodb
|
journal:
|
enabled: true
|
wiredTiger:
|
engineConfig:
|
configString : "session_max=102400"
|
cacheSizeGB: 200
|
setParameter:
|
cursorTimeoutMillis: 120000
|
operationProfiling:
|
mode: slowOp
|
slowOpThresholdMs: 300
|
systemLog:
|
destination: file
|
logAppend: true
|
path: /var/log/mongodb/mongod.log
|
net:
|
port: 27017
|
bindIp: 0.0.0.0
|
maxIncomingConnections: 102400
|
replication:
|
replSetName: rsmmhad03
|
sharding:
|
clusterRole: shardsvr
|
|
sysctl file:
net.ipv4.ip_local_port_range = 1024 65535
|
kernel.shmmax = 1073741824
|
fs.file-max=5000000
|
vm.swappiness = 1
|
vm.dirty_ratio = 15
|
vm.dirty_background_ratio = 5
|
net.core.somaxconn = 4096
|
net.ipv4.tcp_fin_timeout = 30
|
net.ipv4.tcp_keepalive_intvl = 30
|
net.ipv4.tcp_keepalive_time = 120
|
net.ipv4.tcp_max_syn_backlog = 4096
|
/etc/security/limits.d/mongod.conf
mongod soft nproc 128000
|
mongod hard nproc 128000
|
mongod soft nofile 128000
|
mongod hard nofile 128000
|
/lib/systemd/system/mongod.service
[Unit]
|
Description=High-performance, schema-free document-oriented database
|
After=network.target
|
Documentation=https://docs.mongodb.org/manual
|
[Service]
|
User=mongodb
|
Group=mongodb
|
ExecStart=/usr/bin/numactl --interleave=all /usr/bin/mongod --config /etc/mongod.conf
|
PIDFile=/var/run/mongodb/mongod.pid
|
|
file size
|
LimitFSIZE=infinity cpu time
|
LimitCPU=infinity virtual memory size
|
LimitAS=infinity open files
|
LimitNOFILE=128000 processes/threads
|
LimitNPROC=128000 locked memory
|
LimitMEMLOCK=infinity total threads (user+kernel)
|
TasksMax=infinity
|
TasksAccounting=false Recommended limits for for mongod as specified in http://docs.mongodb.org/manual/reference/ulimit/#recommended-settings
|
[Install]
|
WantedBy=multi-user.target
|
The sharding have millions of documents, and millions of queries (more than 100.000.000 queries per day).
The problem is that randomly, we receive an error like the next one:
2018-07-17T15:57:17.978+0200 I - [thread1] pthread_create failed: Resource temporarily unavailable
|
2018-07-17T15:57:17.978+0200 I - [thread1] failed to create service entry worker thread for 10.3.16.1:56153
|
2018-07-17T15:57:17.978+0200 I COMMAND [conn16910] command had.hadCompressed command: find { find: "hadCompressed", filter: { chkin: "2018-08-10", n: 4, occ: "1::3-0/", nid: { $in: [ 0, 30115 ] }, rtype: { $in: [ 1, null ] }, hid: { $in: [ 435179, 231562, 38468, 330644, 307226, 359353, 352215, 88059, 321458, 307181, 85590, 87268, 385303, 252432, 242030, 231596, 307182, 172732, 577889, 38743, 38621, 199946, 435167, 149852, 244963, 391702, 260891, 150236, 307227, 307202, 38730, 156100, 297051, 257466, 498152, 174201, 174250, 577903, 424804, 435152, 197357, 242026, 385251, 205997, 330638, 154974, 37600, 38021, 160751, 435137, 86520, 37217, 363892, 375650, 244960, 252441, 261988, 432659, 609717, 156152, 363893, 149696, 149490, 232726, 87413, 252958, 315863, 219739, 231563, 388212, 412850, 501130, 388772, 231607, 369178, 164246, 38029, 330636, 260877, 38156, 236389, 38068, 257418, 282221, 307186, 299255, 199164, 231575, 88191, 199162, 80373, 200283, 246961, 195476, 424809, 286709, 193058, 208323, 435142, 318242 ] }, lchg: { $gte: new Date(1531749437000) } }, shardVersion: [ Timestamp 22129000|0, ObjectId('5af1c64abeee30df3be9f7db') ] } planSummary: IXSCAN { chkin: 1, n: 1, occ: 1, nid: 1, rtype: 1, hid: 1 } keysExamined:117 docsExamined:41 cursorExhausted:1 numYields:1 nreturned:0 reslen:202 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_command 547ms
|
2018-07-17T15:57:17.978+0200 I NETWORK [thread1] connection accepted from 10.3.102.1:53260 #42127 (32627 connections now open)
|
2018-07-17T15:57:17.978+0200 I - [thread1] pthread_create failed: Resource temporarily unavailable
|
2018-07-17T15:57:17.978+0200 I - [thread1] failed to create service entry worker thread for 10.3.102.1:53260
|
2018-07-17T15:57:17.978+0200 I NETWORK [thread1] connection accepted from 10.3.9.1:47587 #42128 (32627 connections now open)
|
2018-07-17T15:57:17.978+0200 F - [conn14595] Got signal: 6 (Aborted).
|
0x562cd6379171 0x562cd6378389 0x562cd637886d 0x7f49ce038890 0x7f49cdcb3067 0x7f49cdcb4448 0x562cd561a341 0x562cd607e01b 0x562cd607edf0 0x562cd607b18d 0x562cd607bccd 0x562cd607bf30 0x562cd6056ef7 0x562cd5a64478 0x562cd5994b68 0x562cd599508f 0x562cd59a55c3 0x562cd5983d0e 0x562cd59a55c3 0x562cd59b56e7 0x562cd59a55c3 0x562cd5977338 0x562cd5cae7a2 0x562cd5cb0b48 0x562cd5cb17fc 0x562cd5c6ac42 0x562cd5c6b79b 0x562cd58917a0 0x562cd58689af 0x562cd586a0aa 0x562cd5e85480 0x562cd5a89540 0x562cd568a97d 0x562cd568b2ad 0x562cd62df0d1 0x7f49ce031064 0x7f49cdd6662d
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"562CD4DFE000","o":"157B171","s":"_ZN5mongo15printStackTraceERSo"},{"b":"562CD4DFE000","o":"157A389"},{"b":"562CD4DFE000","o":"157A86D"},{"b":"7F49CE029000","o":"F890"},{"b":"7F49CDC7E000","o":"35067","s":"gsignal"},{"b":"7F49CDC7E000","o":"36448","s":"abort"},{"b":"562CD4DFE000","o":"81C341","s":"_ZN5mongo25fassertFailedWithLocationEiPKcj"},{"b":"562CD4DFE000","o":"128001B","s":"_ZN5mongo17WiredTigerSessionC1EP15__wt_connectionPNS_22WiredTigerSessionCacheEmm"},{"b":"562CD4DFE000","o":"1280DF0","s":"_ZN5mongo22WiredTigerSessionCache10getSessionEv"},{"b":"562CD4DFE000","o":"127D18D"},{"b":"562CD4DFE000","o":"127DCCD","s":"_ZN5mongo22WiredTigerRecoveryUnit8_txnOpenEPNS_16OperationContextE"},{"b":"562CD4DFE000","o":"127DF30","s":"_ZN5mongo16WiredTigerCursorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmbPNS_16OperationContextE"},{"b":"562CD4DFE000","o":"1258EF7","s":"_ZNK5mongo23WiredTigerIndexStandard9newCursorEPNS_16OperationContextEb"},{"b":"562CD4DFE000","o":"C66478","s":"_ZNK5mongo17IndexAccessMethod9newCursorEPNS_16OperationContextEb"},{"b":"562CD4DFE000","o":"B96B68","s":"_ZN5mongo9IndexScan13initIndexScanEv"},{"b":"562CD4DFE000","o":"B9708F","s":"_ZN5mongo9IndexScan6doWorkEPm"},{"b":"562CD4DFE000","o":"BA75C3","s":"_ZN5mongo9PlanStage4workEPm"},{"b":"562CD4DFE000","o":"B85D0E","s":"_ZN5mongo10FetchStage6doWorkEPm"},{"b":"562CD4DFE000","o":"BA75C3","s":"_ZN5mongo9PlanStage4workEPm"},{"b":"562CD4DFE000","o":"BB76E7","s":"_ZN5mongo16ShardFilterStage6doWorkEPm"},{"b":"562CD4DFE000","o":"BA75C3","s":"_ZN5mongo9PlanStage4workEPm"},{"b":"562CD4DFE000","o":"B79338","s":"_ZN5mongo15CachedPlanStage12pickBestPlanEPNS_15PlanYieldPolicyE"},{"b":"562CD4DFE000","o":"EB07A2","s":"_ZN5mongo12PlanExecutor12pickBestPlanENS0_11YieldPolicyEPKNS_10CollectionE"},{"b":"562CD4DFE000","o":"EB2B48","s":"_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_11YieldPolicyE"},{"b":"562CD4DFE000","o":"EB37FC","s":"_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionENS0_11YieldPolicyE"},{"b":"562CD4DFE000","o":"E6CC42","s":"_ZN5mongo11getExecutorEPNS_16OperationContextEPNS_10CollectionESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS5_EENS_12PlanExecutor11YieldPolicyEm"},{"b":"562CD4DFE000","o":"E6D79B","s":"_ZN5mongo15getExecutorFindEPNS_16OperationContextEPNS_10CollectionERKNS_15NamespaceStringESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS8_EENS_12PlanExecutor11YieldPolicyE"},{"b":"562CD4DFE000","o":"A937A0","s":"_ZN5mongo7FindCmd3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_7BSONObjEiRS8_RNS_14BSONObjBuilderE"},{"b":"562CD4DFE000","o":"A6A9AF","s":"_ZN5mongo7Command3runEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS3_21ReplyBuilderInterfaceE"},{"b":"562CD4DFE000","o":"A6C0AA","s":"_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_RKNS_3rpc16RequestInterfaceEPNS4_21ReplyBuilderInterfaceE"},{"b":"562CD4DFE000","o":"1087480","s":"_ZN5mongo11runCommandsEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS2_21ReplyBuilderInterfaceE"},{"b":"562CD4DFE000","o":"C8B540","s":"_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE"},{"b":"562CD4DFE000","o":"88C97D","s":"_ZN5mongo23ServiceEntryPointMongod12_sessionLoopERKSt10shared_ptrINS_9transport7SessionEE"},{"b":"562CD4DFE000","o":"88D2AD"},{"b":"562CD4DFE000","o":"14E10D1"},{"b":"7F49CE029000","o":"8064"},{"b":"7F49CDC7E000","o":"E862D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.16", "gitVersion" : "0d6a9242c11b99ddadcfb6e86a850b6ba487530a", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.16.0-6-amd64", "version" : "#1 SMP Debian 3.16.56-1+deb8u1 (2018-05-08)", "machine" : "x86_64" }, "somap" : [ { "b" : "562CD4DFE000", "elfType" : 3, "buildId" : "36452F27FE7A41D0E57DDE38A17B3FAE9980B0BE" }, { "b" : "7FFD853E8000", "path" : "linux-vdso.so.1", "elfType" : 3, "buildId" : "90F495E259305E7C4F498541D91C9E1240057F52" }, { "b" : "7F49CEF66000", "path" : "/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "EDE40F0BC2115063088BF442E0F2ED84BF76B11E" }, { "b" : "7F49CEB69000", "path" : "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "0C9DA403601A5EEA627AF96E1EB63DD22B8DC28B" }, { "b" : "7F49CE961000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "A63C95FB33CCA970E141D2E13774B997C1CF0565" }, { "b" : "7F49CE75D000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "D70B531D672A34D71DB42EB32B68E63F2DCC5B6A" }, { "b" : "7F49CE45C000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "152C93BA3E8590F7ED0BCDDF868600D55EC4DD6F" }, { "b" : "7F49CE246000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "BAC839560495859598E8515CBAED73C7799AE1FF" }, { "b" : "7F49CE029000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9DA9387A60FFC196AEDB9526275552AFEF499C44" }, { "b" : "7F49CDC7E000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "48C48BC6ABB794461B8A558DD76B29876A0551F0" }, { "b" : "7F49CF1C7000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "1D98D41FBB1EABA7EC05D0FD7624B85D6F51C03C" } ] }}
|
mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x562cd6379171]
|
mongod(+0x157A389) [0x562cd6378389]
|
mongod(+0x157A86D) [0x562cd637886d]
|
libpthread.so.0(+0xF890) [0x7f49ce038890]
|
libc.so.6(gsignal+0x37) [0x7f49cdcb3067]
|
libc.so.6(abort+0x148) [0x7f49cdcb4448]
|
mongod(_ZN5mongo25fassertFailedWithLocationEiPKcj+0x0) [0x562cd561a341]
|
mongod(_ZN5mongo17WiredTigerSessionC1EP15__wt_connectionPNS_22WiredTigerSessionCacheEmm+0xBB) [0x562cd607e01b]
|
mongod(_ZN5mongo22WiredTigerSessionCache10getSessionEv+0xE0) [0x562cd607edf0]
|
mongod(+0x127D18D) [0x562cd607b18d]
|
mongod(_ZN5mongo22WiredTigerRecoveryUnit8_txnOpenEPNS_16OperationContextE+0x19D) [0x562cd607bccd]
|
mongod(_ZN5mongo16WiredTigerCursorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmbPNS_16OperationContextE+0x90) [0x562cd607bf30]
|
mongod(_ZNK5mongo23WiredTigerIndexStandard9newCursorEPNS_16OperationContextEb+0x157) [0x562cd6056ef7]
|
mongod(_ZNK5mongo17IndexAccessMethod9newCursorEPNS_16OperationContextEb+0x28) [0x562cd5a64478]
|
mongod(_ZN5mongo9IndexScan13initIndexScanEv+0x58) [0x562cd5994b68]
|
mongod(_ZN5mongo9IndexScan6doWorkEPm+0x14F) [0x562cd599508f]
|
mongod(_ZN5mongo9PlanStage4workEPm+0x63) [0x562cd59a55c3]
|
mongod(_ZN5mongo10FetchStage6doWorkEPm+0x29E) [0x562cd5983d0e]
|
mongod(_ZN5mongo9PlanStage4workEPm+0x63) [0x562cd59a55c3]
|
mongod(_ZN5mongo16ShardFilterStage6doWorkEPm+0x77) [0x562cd59b56e7]
|
mongod(_ZN5mongo9PlanStage4workEPm+0x63) [0x562cd59a55c3]
|
mongod(_ZN5mongo15CachedPlanStage12pickBestPlanEPNS_15PlanYieldPolicyE+0x198) [0x562cd5977338]
|
mongod(_ZN5mongo12PlanExecutor12pickBestPlanENS0_11YieldPolicyEPKNS_10CollectionE+0xF2) [0x562cd5cae7a2]
|
mongod(_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_11YieldPolicyE+0x2D8) [0x562cd5cb0b48]
|
mongod(_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionENS0_11YieldPolicyE+0xEC) [0x562cd5cb17fc]
|
mongod(_ZN5mongo11getExecutorEPNS_16OperationContextEPNS_10CollectionESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS5_EENS_12PlanExecutor11YieldPolicyEm+0x132) [0x562cd5c6ac42]
|
mongod(_ZN5mongo15getExecutorFindEPNS_16OperationContextEPNS_10CollectionERKNS_15NamespaceStringESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS8_EENS_12PlanExecutor11YieldPolicyE+0x8B) [0x562cd5c6b79b]
|
mongod(_ZN5mongo7FindCmd3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_7BSONObjEiRS8_RNS_14BSONObjBuilderE+0xC90) [0x562cd58917a0]
|
mongod(_ZN5mongo7Command3runEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS3_21ReplyBuilderInterfaceE+0x4FF) [0x562cd58689af]
|
mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_RKNS_3rpc16RequestInterfaceEPNS4_21ReplyBuilderInterfaceE+0xF6A) [0x562cd586a0aa]
|
mongod(_ZN5mongo11runCommandsEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS2_21ReplyBuilderInterfaceE+0x240) [0x562cd5e85480]
|
mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xD30) [0x562cd5a89540]
|
mongod(_ZN5mongo23ServiceEntryPointMongod12_sessionLoopERKSt10shared_ptrINS_9transport7SessionEE+0x1FD) [0x562cd568a97d]
|
mongod(+0x88D2AD) [0x562cd568b2ad]
|
mongod(+0x14E10D1) [0x562cd62df0d1]
|
libpthread.so.0(+0x8064) [0x7f49ce031064]
|
libc.so.6(clone+0x6D) [0x7f49cdd6662d]
|
----- END BACKTRACE -----
|
2018-07-17T15:57:17.978+0200 I - [thread1] pthread_create failed: Resource temporarily unavailable
|
2018-07-17T15:57:17.978+0200 I - [thread1] failed to create service entry worker thread for 10.3.9.1:47587
|
In syslog we get the next:
Jul 17 15:57:15 mmhad03b kernel: [78725.202597] TCP: TCP: Possible SYN flooding on port 27017. Sending cookies. Check SNMP counters.
|
Jul 17 15:57:40 mmhad03b systemd[1]: mongod.service: main process exited, code=killed, status=6/ABRT
|
Jul 17 15:57:40 mmhad03b systemd[1]: Unit mongod.service entered failed state.
|
Randomly, we get this error aswell in syslog:
Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 8192
|
Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 8192
|
Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 12288
|
Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 8192
|
Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 12288
|
Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 8192
|
Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 8192
|
|
We have upgraded all server limits and applied them, but appears that where isn't any improvement.
Mongo version is 3.4.16 in sharding and also in mongos .
I'm attaching diagnostic data aswell.
|