-
Type: Bug
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Sharding, WiredTiger
-
Labels:None
-
ALL
Hi,
We've a sharding, based on 8 servers, with 4 replicaset with this structure:
- replicaset1: server01a / server01b
- replicaset2: server02a / server02b
- replicaset3: server03a / server03b
- replicaset4: server04a / server04b
The servers are physical servers, have SSD, 32 threads and 256Gb of RAM.
The mongodb config on each node is similar to this one:
storage: dbPath: /var/lib/mongodb journal: enabled: true wiredTiger: engineConfig: configString : "session_max=102400" cacheSizeGB: 200 setParameter: cursorTimeoutMillis: 120000 operationProfiling: mode: slowOp slowOpThresholdMs: 300 systemLog: destination: file logAppend: true path: /var/log/mongodb/mongod.log net: port: 27017 bindIp: 0.0.0.0 maxIncomingConnections: 102400 replication: replSetName: rsmmhad03 sharding: clusterRole: shardsvr
sysctl file:
net.ipv4.ip_local_port_range = 1024 65535 kernel.shmmax = 1073741824 fs.file-max=5000000 vm.swappiness = 1 vm.dirty_ratio = 15 vm.dirty_background_ratio = 5 net.core.somaxconn = 4096 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_intvl = 30 net.ipv4.tcp_keepalive_time = 120 net.ipv4.tcp_max_syn_backlog = 4096
/etc/security/limits.d/mongod.conf
mongod soft nproc 128000 mongod hard nproc 128000 mongod soft nofile 128000 mongod hard nofile 128000
/lib/systemd/system/mongod.service
[Unit] Description=High-performance, schema-free document-oriented database After=network.target Documentation=https://docs.mongodb.org/manual [Service] User=mongodb Group=mongodb ExecStart=/usr/bin/numactl --interleave=all /usr/bin/mongod --config /etc/mongod.conf PIDFile=/var/run/mongodb/mongod.pid file size LimitFSIZE=infinity cpu time LimitCPU=infinity virtual memory size LimitAS=infinity open files LimitNOFILE=128000 processes/threads LimitNPROC=128000 locked memory LimitMEMLOCK=infinity total threads (user+kernel) TasksMax=infinity TasksAccounting=false Recommended limits for for mongod as specified in http://docs.mongodb.org/manual/reference/ulimit/#recommended-settings [Install] WantedBy=multi-user.target
The sharding have millions of documents, and millions of queries (more than 100.000.000 queries per day).
The problem is that randomly, we receive an error like the next one:
2018-07-17T15:57:17.978+0200 I - [thread1] pthread_create failed: Resource temporarily unavailable 2018-07-17T15:57:17.978+0200 I - [thread1] failed to create service entry worker thread for 10.3.16.1:56153 2018-07-17T15:57:17.978+0200 I COMMAND [conn16910] command had.hadCompressed command: find { find: "hadCompressed", filter: { chkin: "2018-08-10", n: 4, occ: "1::3-0/", nid: { $in: [ 0, 30115 ] }, rtype: { $in: [ 1, null ] }, hid: { $in: [ 435179, 231562, 38468, 330644, 307226, 359353, 352215, 88059, 321458, 307181, 85590, 87268, 385303, 252432, 242030, 231596, 307182, 172732, 577889, 38743, 38621, 199946, 435167, 149852, 244963, 391702, 260891, 150236, 307227, 307202, 38730, 156100, 297051, 257466, 498152, 174201, 174250, 577903, 424804, 435152, 197357, 242026, 385251, 205997, 330638, 154974, 37600, 38021, 160751, 435137, 86520, 37217, 363892, 375650, 244960, 252441, 261988, 432659, 609717, 156152, 363893, 149696, 149490, 232726, 87413, 252958, 315863, 219739, 231563, 388212, 412850, 501130, 388772, 231607, 369178, 164246, 38029, 330636, 260877, 38156, 236389, 38068, 257418, 282221, 307186, 299255, 199164, 231575, 88191, 199162, 80373, 200283, 246961, 195476, 424809, 286709, 193058, 208323, 435142, 318242 ] }, lchg: { $gte: new Date(1531749437000) } }, shardVersion: [ Timestamp 22129000|0, ObjectId('5af1c64abeee30df3be9f7db') ] } planSummary: IXSCAN { chkin: 1, n: 1, occ: 1, nid: 1, rtype: 1, hid: 1 } keysExamined:117 docsExamined:41 cursorExhausted:1 numYields:1 nreturned:0 reslen:202 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } protocol:op_command 547ms 2018-07-17T15:57:17.978+0200 I NETWORK [thread1] connection accepted from 10.3.102.1:53260 #42127 (32627 connections now open) 2018-07-17T15:57:17.978+0200 I - [thread1] pthread_create failed: Resource temporarily unavailable 2018-07-17T15:57:17.978+0200 I - [thread1] failed to create service entry worker thread for 10.3.102.1:53260 2018-07-17T15:57:17.978+0200 I NETWORK [thread1] connection accepted from 10.3.9.1:47587 #42128 (32627 connections now open) 2018-07-17T15:57:17.978+0200 F - [conn14595] Got signal: 6 (Aborted). 0x562cd6379171 0x562cd6378389 0x562cd637886d 0x7f49ce038890 0x7f49cdcb3067 0x7f49cdcb4448 0x562cd561a341 0x562cd607e01b 0x562cd607edf0 0x562cd607b18d 0x562cd607bccd 0x562cd607bf30 0x562cd6056ef7 0x562cd5a64478 0x562cd5994b68 0x562cd599508f 0x562cd59a55c3 0x562cd5983d0e 0x562cd59a55c3 0x562cd59b56e7 0x562cd59a55c3 0x562cd5977338 0x562cd5cae7a2 0x562cd5cb0b48 0x562cd5cb17fc 0x562cd5c6ac42 0x562cd5c6b79b 0x562cd58917a0 0x562cd58689af 0x562cd586a0aa 0x562cd5e85480 0x562cd5a89540 0x562cd568a97d 0x562cd568b2ad 0x562cd62df0d1 0x7f49ce031064 0x7f49cdd6662d ----- BEGIN BACKTRACE ----- {"backtrace":[{"b":"562CD4DFE000","o":"157B171","s":"_ZN5mongo15printStackTraceERSo"},{"b":"562CD4DFE000","o":"157A389"},{"b":"562CD4DFE000","o":"157A86D"},{"b":"7F49CE029000","o":"F890"},{"b":"7F49CDC7E000","o":"35067","s":"gsignal"},{"b":"7F49CDC7E000","o":"36448","s":"abort"},{"b":"562CD4DFE000","o":"81C341","s":"_ZN5mongo25fassertFailedWithLocationEiPKcj"},{"b":"562CD4DFE000","o":"128001B","s":"_ZN5mongo17WiredTigerSessionC1EP15__wt_connectionPNS_22WiredTigerSessionCacheEmm"},{"b":"562CD4DFE000","o":"1280DF0","s":"_ZN5mongo22WiredTigerSessionCache10getSessionEv"},{"b":"562CD4DFE000","o":"127D18D"},{"b":"562CD4DFE000","o":"127DCCD","s":"_ZN5mongo22WiredTigerRecoveryUnit8_txnOpenEPNS_16OperationContextE"},{"b":"562CD4DFE000","o":"127DF30","s":"_ZN5mongo16WiredTigerCursorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmbPNS_16OperationContextE"},{"b":"562CD4DFE000","o":"1258EF7","s":"_ZNK5mongo23WiredTigerIndexStandard9newCursorEPNS_16OperationContextEb"},{"b":"562CD4DFE000","o":"C66478","s":"_ZNK5mongo17IndexAccessMethod9newCursorEPNS_16OperationContextEb"},{"b":"562CD4DFE000","o":"B96B68","s":"_ZN5mongo9IndexScan13initIndexScanEv"},{"b":"562CD4DFE000","o":"B9708F","s":"_ZN5mongo9IndexScan6doWorkEPm"},{"b":"562CD4DFE000","o":"BA75C3","s":"_ZN5mongo9PlanStage4workEPm"},{"b":"562CD4DFE000","o":"B85D0E","s":"_ZN5mongo10FetchStage6doWorkEPm"},{"b":"562CD4DFE000","o":"BA75C3","s":"_ZN5mongo9PlanStage4workEPm"},{"b":"562CD4DFE000","o":"BB76E7","s":"_ZN5mongo16ShardFilterStage6doWorkEPm"},{"b":"562CD4DFE000","o":"BA75C3","s":"_ZN5mongo9PlanStage4workEPm"},{"b":"562CD4DFE000","o":"B79338","s":"_ZN5mongo15CachedPlanStage12pickBestPlanEPNS_15PlanYieldPolicyE"},{"b":"562CD4DFE000","o":"EB07A2","s":"_ZN5mongo12PlanExecutor12pickBestPlanENS0_11YieldPolicyEPKNS_10CollectionE"},{"b":"562CD4DFE000","o":"EB2B48","s":"_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_11YieldPolicyE"},{"b":"562CD4DFE000","o":"EB37FC","s":"_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionENS0_11YieldPolicyE"},{"b":"562CD4DFE000","o":"E6CC42","s":"_ZN5mongo11getExecutorEPNS_16OperationContextEPNS_10CollectionESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS5_EENS_12PlanExecutor11YieldPolicyEm"},{"b":"562CD4DFE000","o":"E6D79B","s":"_ZN5mongo15getExecutorFindEPNS_16OperationContextEPNS_10CollectionERKNS_15NamespaceStringESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS8_EENS_12PlanExecutor11YieldPolicyE"},{"b":"562CD4DFE000","o":"A937A0","s":"_ZN5mongo7FindCmd3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_7BSONObjEiRS8_RNS_14BSONObjBuilderE"},{"b":"562CD4DFE000","o":"A6A9AF","s":"_ZN5mongo7Command3runEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS3_21ReplyBuilderInterfaceE"},{"b":"562CD4DFE000","o":"A6C0AA","s":"_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_RKNS_3rpc16RequestInterfaceEPNS4_21ReplyBuilderInterfaceE"},{"b":"562CD4DFE000","o":"1087480","s":"_ZN5mongo11runCommandsEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS2_21ReplyBuilderInterfaceE"},{"b":"562CD4DFE000","o":"C8B540","s":"_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE"},{"b":"562CD4DFE000","o":"88C97D","s":"_ZN5mongo23ServiceEntryPointMongod12_sessionLoopERKSt10shared_ptrINS_9transport7SessionEE"},{"b":"562CD4DFE000","o":"88D2AD"},{"b":"562CD4DFE000","o":"14E10D1"},{"b":"7F49CE029000","o":"8064"},{"b":"7F49CDC7E000","o":"E862D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.4.16", "gitVersion" : "0d6a9242c11b99ddadcfb6e86a850b6ba487530a", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.16.0-6-amd64", "version" : "#1 SMP Debian 3.16.56-1+deb8u1 (2018-05-08)", "machine" : "x86_64" }, "somap" : [ { "b" : "562CD4DFE000", "elfType" : 3, "buildId" : "36452F27FE7A41D0E57DDE38A17B3FAE9980B0BE" }, { "b" : "7FFD853E8000", "path" : "linux-vdso.so.1", "elfType" : 3, "buildId" : "90F495E259305E7C4F498541D91C9E1240057F52" }, { "b" : "7F49CEF66000", "path" : "/usr/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "EDE40F0BC2115063088BF442E0F2ED84BF76B11E" }, { "b" : "7F49CEB69000", "path" : "/usr/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "0C9DA403601A5EEA627AF96E1EB63DD22B8DC28B" }, { "b" : "7F49CE961000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "A63C95FB33CCA970E141D2E13774B997C1CF0565" }, { "b" : "7F49CE75D000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "D70B531D672A34D71DB42EB32B68E63F2DCC5B6A" }, { "b" : "7F49CE45C000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "152C93BA3E8590F7ED0BCDDF868600D55EC4DD6F" }, { "b" : "7F49CE246000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "BAC839560495859598E8515CBAED73C7799AE1FF" }, { "b" : "7F49CE029000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9DA9387A60FFC196AEDB9526275552AFEF499C44" }, { "b" : "7F49CDC7E000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "48C48BC6ABB794461B8A558DD76B29876A0551F0" }, { "b" : "7F49CF1C7000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "1D98D41FBB1EABA7EC05D0FD7624B85D6F51C03C" } ] }} mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x562cd6379171] mongod(+0x157A389) [0x562cd6378389] mongod(+0x157A86D) [0x562cd637886d] libpthread.so.0(+0xF890) [0x7f49ce038890] libc.so.6(gsignal+0x37) [0x7f49cdcb3067] libc.so.6(abort+0x148) [0x7f49cdcb4448] mongod(_ZN5mongo25fassertFailedWithLocationEiPKcj+0x0) [0x562cd561a341] mongod(_ZN5mongo17WiredTigerSessionC1EP15__wt_connectionPNS_22WiredTigerSessionCacheEmm+0xBB) [0x562cd607e01b] mongod(_ZN5mongo22WiredTigerSessionCache10getSessionEv+0xE0) [0x562cd607edf0] mongod(+0x127D18D) [0x562cd607b18d] mongod(_ZN5mongo22WiredTigerRecoveryUnit8_txnOpenEPNS_16OperationContextE+0x19D) [0x562cd607bccd] mongod(_ZN5mongo16WiredTigerCursorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmbPNS_16OperationContextE+0x90) [0x562cd607bf30] mongod(_ZNK5mongo23WiredTigerIndexStandard9newCursorEPNS_16OperationContextEb+0x157) [0x562cd6056ef7] mongod(_ZNK5mongo17IndexAccessMethod9newCursorEPNS_16OperationContextEb+0x28) [0x562cd5a64478] mongod(_ZN5mongo9IndexScan13initIndexScanEv+0x58) [0x562cd5994b68] mongod(_ZN5mongo9IndexScan6doWorkEPm+0x14F) [0x562cd599508f] mongod(_ZN5mongo9PlanStage4workEPm+0x63) [0x562cd59a55c3] mongod(_ZN5mongo10FetchStage6doWorkEPm+0x29E) [0x562cd5983d0e] mongod(_ZN5mongo9PlanStage4workEPm+0x63) [0x562cd59a55c3] mongod(_ZN5mongo16ShardFilterStage6doWorkEPm+0x77) [0x562cd59b56e7] mongod(_ZN5mongo9PlanStage4workEPm+0x63) [0x562cd59a55c3] mongod(_ZN5mongo15CachedPlanStage12pickBestPlanEPNS_15PlanYieldPolicyE+0x198) [0x562cd5977338] mongod(_ZN5mongo12PlanExecutor12pickBestPlanENS0_11YieldPolicyEPKNS_10CollectionE+0xF2) [0x562cd5cae7a2] mongod(_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEENS0_11YieldPolicyE+0x2D8) [0x562cd5cb0b48] mongod(_ZN5mongo12PlanExecutor4makeEPNS_16OperationContextESt10unique_ptrINS_10WorkingSetESt14default_deleteIS4_EES3_INS_9PlanStageES5_IS8_EES3_INS_13QuerySolutionES5_ISB_EES3_INS_14CanonicalQueryES5_ISE_EEPKNS_10CollectionENS0_11YieldPolicyE+0xEC) [0x562cd5cb17fc] mongod(_ZN5mongo11getExecutorEPNS_16OperationContextEPNS_10CollectionESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS5_EENS_12PlanExecutor11YieldPolicyEm+0x132) [0x562cd5c6ac42] mongod(_ZN5mongo15getExecutorFindEPNS_16OperationContextEPNS_10CollectionERKNS_15NamespaceStringESt10unique_ptrINS_14CanonicalQueryESt14default_deleteIS8_EENS_12PlanExecutor11YieldPolicyE+0x8B) [0x562cd5c6b79b] mongod(_ZN5mongo7FindCmd3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERNS_7BSONObjEiRS8_RNS_14BSONObjBuilderE+0xC90) [0x562cd58917a0] mongod(_ZN5mongo7Command3runEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS3_21ReplyBuilderInterfaceE+0x4FF) [0x562cd58689af] mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_RKNS_3rpc16RequestInterfaceEPNS4_21ReplyBuilderInterfaceE+0xF6A) [0x562cd586a0aa] mongod(_ZN5mongo11runCommandsEPNS_16OperationContextERKNS_3rpc16RequestInterfaceEPNS2_21ReplyBuilderInterfaceE+0x240) [0x562cd5e85480] mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xD30) [0x562cd5a89540] mongod(_ZN5mongo23ServiceEntryPointMongod12_sessionLoopERKSt10shared_ptrINS_9transport7SessionEE+0x1FD) [0x562cd568a97d] mongod(+0x88D2AD) [0x562cd568b2ad] mongod(+0x14E10D1) [0x562cd62df0d1] libpthread.so.0(+0x8064) [0x7f49ce031064] libc.so.6(clone+0x6D) [0x7f49cdd6662d] ----- END BACKTRACE ----- 2018-07-17T15:57:17.978+0200 I - [thread1] pthread_create failed: Resource temporarily unavailable 2018-07-17T15:57:17.978+0200 I - [thread1] failed to create service entry worker thread for 10.3.9.1:47587
In syslog we get the next:
Jul 17 15:57:15 mmhad03b kernel: [78725.202597] TCP: TCP: Possible SYN flooding on port 27017. Sending cookies. Check SNMP counters. Jul 17 15:57:40 mmhad03b systemd[1]: mongod.service: main process exited, code=killed, status=6/ABRT Jul 17 15:57:40 mmhad03b systemd[1]: Unit mongod.service entered failed state.
Randomly, we get this error aswell in syslog:
Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 8192 Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 8192 Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 12288 Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 8192 Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 12288 Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 8192 Jul 17 16:17:25 mmhad03b numactl[20402]: src/third_party/gperftools-2.5/src/central_freelist.cc:333] tcmalloc: allocation failed 8192
We have upgraded all server limits and applied them, but appears that where isn't any improvement.
Mongo version is 3.4.16 in sharding and also in mongos .
I'm attaching diagnostic data aswell.