[SERVER-21323] Mongo Router frequently going down with a low limit for open files Created: 06/Nov/15  Updated: 19/Nov/15  Resolved: 17/Nov/15

Status: Closed
Project: Core Server
Component/s: Security
Affects Version/s: 3.0.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Debasish Nayak Assignee: Spencer Jackson
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Security C 11/20/15
Participants:

 Description   

I've a mongo cluster version: 3.0.6 with 3 config nodes, 2 data nodes and 1 router. The router frequently goes down almost 2 times a day with the below error:

2015-11-05T20:22:57.376+0000 I ACCESS   [conn1770]  authenticate db: azuga { authenticate: 1, user: "azugaapps", nonce: "xxx", key: "xxx" }
2015-11-05T20:22:59.135+0000 I SHARDING [LockPinger] cluster ec2-54-184-178-101.us-west-2.compute.amazonaws.com:27019,ec2-54-184-71-89.us-west-2.compute.amazonaws.com:
27019,ec2-54-212-135-86.us-west-2.compute.amazonaws.com:27019 pinged successfully at Thu Nov  5 20:22:58 2015 by distributed lock pinger 'ec2-54-184-178-101.us-west-2.
compute.amazonaws.com:27019,ec2-54-184-71-89.us-west-2.compute.amazonaws.com:27019,ec2-54-212-135-86.us-west-2.compute.amazonaws.com:27019/ip-10-83-19-157:27017:144671
3627:1804289383', sleeping for 30000ms
2015-11-05T20:23:01.074+0000 I NETWORK  [mongosMain] connection accepted from 10.217.139.52:53068 #1771 (434 connections now open)
2015-11-05T20:23:01.074+0000 I ACCESS   [conn1771]  authenticate db: azuga { authenticate: 1, user: "azugaapps", nonce: "xxx", key: "xxx" }
2015-11-05T20:23:01.083+0000 F -        [conn1771] Got signal: 6 (Aborted).
 
 0xa4c139 0xa4ba02 0xa4bdb6 0x7ff06b270d40 0x7ff06b270cc9 0x7ff06b2740d8 0x88ed69 0x660ced 0x6618f7 0x633707 0x65bf66 0x619d31 0x62077e 0x612d9d 0x68b598 0x960f18 0x60
a3cd 0x60b6be 0x60b8c2 0x9284db 0x99db5f 0x8c3a06 0x8c4693 0x8cc62a 0x95e42c 0x8e2fab 0x9703cd 0x95d961 0x5c8fd8 0x9fe60b 0x7ff06c86c182 0x7ff06b33447d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"64C139"},{"b":"400000","o":"64BA02"},{"b":"400000","o":"64BDB6"},{"b":"7FF06B23A000","o":"36D40"},{"b":"7FF06B23A000","o":"36CC9"},{"b
":"7FF06B23A000","o":"3A0D8"},{"b":"400000","o":"48ED69"},{"b":"400000","o":"260CED"},{"b":"400000","o":"2618F7"},{"b":"400000","o":"233707"},{"b":"400000","o":"25BF66
"},{"b":"400000","o":"219D31"},{"b":"400000","o":"22077E"},{"b":"400000","o":"212D9D"},{"b":"400000","o":"28B598"},{"b":"400000","o":"560F18"},{"b":"400000","o":"20A3C
D"},{"b":"400000","o":"20B6BE"},{"b":"400000","o":"20B8C2"},{"b":"400000","o":"5284DB"},{"b":"400000","o":"59DB5F"},{"b":"400000","o":"4C3A06"},{"b":"400000","o":"4C46
93"},{"b":"400000","o":"4CC62A"},{"b":"400000","o":"55E42C"},{"b":"400000","o":"4E2FAB"},{"b":"400000","o":"5703CD"},{"b":"400000","o":"55D961"},{"b":"400000","o":"1C8
FD8"},{"b":"400000","o":"5FE60B"},{"b":"7FF06C864000","o":"8182"},{"b":"7FF06B23A000","o":"FA47D"}],"processInfo":{ "mongodbVersion" : "3.0.6", "gitVersion" : "1ef45a2
3a4c5e3480ac919b28afcba3c615488f2", "uname" : { "sysname" : "Linux", "release" : "3.13.0-48-generic", "version" : "#80-Ubuntu SMP Thu Mar 12 11:16:15 UTC 2015", "machi
ne" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "D31506D475D532F90CB8C8BDBF0D2852FE47CA5A" }, { "b" : "7FFFDB6FD000", "elfType" : 3, "buildI
d" : "341443CC258F9F27709CAA36C8CB321E4BBFFF95" }, { "b" : "7FF06C864000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE4
44731BB0461202EF57F7C39542" }, { "b" : "7FF06C606000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "FF43D0947510134A8A494063A3C1CF3CEB
B27791" }, { "b" : "7FF06C22B000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "B927879B878D90DD9FF4B15B00E7799AA8E0272F" }, { "b" 
: "7FF06C023000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7FF06BE1F000", "path" 
: "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7FF06BB1B000", "path" : "/usr/lib/x86_64-linux-
gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "19EFDDAB11B3BF5C71570078C59F91CF6592CE9E" }, { "b" : "7FF06B815000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfT
ype" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7FF06B5FF000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : 
"8D0AA71411580EE6C08809695C3984769F25725B" }, { "b" : "7FF06B23A000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30C94DC66A1FE95180C3D68D2
B89E576D5AE213C" }, { "b" : "7FF06CA82000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
 mongos(_ZN5mongo15printStackTraceERSo+0x29) [0xa4c139]
 mongos(+0x64BA02) [0xa4ba02]
 mongos(+0x64BDB6) [0xa4bdb6]
 libc.so.6(+0x36D40) [0x7ff06b270d40]
 libc.so.6(gsignal+0x39) [0x7ff06b270cc9]
 libc.so.6(abort+0x148) [0x7ff06b2740d8]
mongos(_ZN5mongo12SecureRandom6createEv+0x1B9) [0x88ed69]
 mongos(_ZN5mongo31SaslSCRAMSHA1ClientConversation10_firstStepEPSs+0x16D) [0x660ced]
 mongos(_ZN5mongo31SaslSCRAMSHA1ClientConversation4stepERKNS_10StringDataEPSs+0x247) [0x6618f7]
 mongos(_ZN5mongo23NativeSaslClientSession4stepERKNS_10StringDataEPSs+0x27) [0x633707]
 mongos(+0x25BF66) [0x65bf66]
 mongos(_ZN5mongo20DBClientWithCommands5_authERKNS_7BSONObjE+0x191) [0x619d31]
 mongos(_ZN5mongo18DBClientConnection5_authERKNS_7BSONObjE+0x16E) [0x62077e]
 mongos(_ZN5mongo20DBClientWithCommands4authERKNS_7BSONObjE+0x1D) [0x612d9d]
 mongos(_ZN5mongo24authenticateInternalUserEPNS_20DBClientWithCommandsE+0xA8) [0x68b598]
 mongos(_ZN5mongo22ShardingConnectionHook8onCreateEPNS_12DBClientBaseE+0x108) [0x960f18]
 mongos(_ZN5mongo16DBConnectionPool8onCreateEPNS_12DBClientBaseE+0x2D) [0x60a3cd]
 mongos(_ZN5mongo16DBConnectionPool13_finishCreateERKSsdPNS_12DBClientBaseE+0x11E) [0x60b6be]
 mongos(_ZN5mongo16DBConnectionPool3getERKNS_16ConnectionStringEd+0xD2) [0x60b8c2]
 mongos(_ZN5mongo20DBClientMultiCommand7sendAllEv+0xAB) [0x9284db]
 mongos(_ZN5mongo14BatchWriteExec12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x70F) [0x99db5f]
 mongos(_ZN5mongo13ClusterWriter10shardWriteERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x2C6) [0x8c3a06]
 mongos(_ZN5mongo13ClusterWriter5writeERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x603) [0x8c4693]
 mongos(_ZN5mongo15ClusterWriteCmd3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x26A) [0x8cc62a]
 mongos(_ZN5mongo7Command22execCommandClientBasicEPNS_16OperationContextEPS0_RNS_11ClientBasicEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x3EC) [0x95e42c]
 mongos(_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x21B) [0x8e2fab]
 mongos(_ZN5mongo8Strategy15clientCommandOpERNS_7RequestE+0x1BD) [0x9703cd]
 mongos(_ZN5mongo7Request7processEi+0x5A1) [0x95d961]
 mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x58) [0x5c8fd8]
 mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x34B) [0x9fe60b]
 libpthread.so.0(+0x8182) [0x7ff06c86c182]
 libc.so.6(clone+0x6D) [0x7ff06b33447d]
-----  END BACKTRACE  -----

Below is the detail of db.stats() on the major DB where we have 90% of data.

mongos> db.stats();
{
	"raw" : {
		"ec2-52-12-123-167.us-west-2.compute.amazonaws.com:27017" : {
			"db" : "azuga",
			"collections" : 8,
			"objects" : 362467895,
			"avgObjSize" : 765.6894732704534,
			"dataSize" : 277537851600,
			"storageSize" : 286915962896,
			"numExtents" : 184,
			"indexes" : 10,
			"indexSize" : 98088943680,
			"fileSize" : 397024428032,
			"nsSizeMB" : 16,
			"extentFreeList" : {
				"num" : 0,
				"totalSize" : 0
			},
			"dataFileVersion" : {
				"major" : 4,
				"minor" : 22
			},
			"ok" : 1
		},
		"ec2-54-188-151-194.us-west-2.compute.amazonaws.com:27017" : {
			"db" : "azuga",
			"collections" : 3,
			"objects" : 78466712,
			"avgObjSize" : 1008.0000481223171,
			"dataSize" : 79094449472,
			"storageSize" : 82284461520,
			"numExtents" : 61,
			"indexes" : 5,
			"indexSize" : 26999073360,
			"fileSize" : 120134303744,
			"nsSizeMB" : 16,
			"extentFreeList" : {
				"num" : 0,
				"totalSize" : 0
			},
			"dataFileVersion" : {
				"major" : 4,
				"minor" : 22
			},
			"ok" : 1
		}
	},
	"objects" : 440934607,
	"avgObjSize" : 808.2431719200484,
	"dataSize" : 356632301072,
	"storageSize" : 369200424416,
	"numExtents" : 245,
	"indexes" : 15,
	"indexSize" : 125088017040,
	"fileSize" : 517158731776,
	"extentFreeList" : {
		"num" : 0,
		"totalSize" : 0
	},
	"ok" : 1
}



 Comments   
Comment by Spencer Jackson [ 17/Nov/15 ]

That's great! I'll close this ticket, then. It's no problem at all, and I'm glad your cluster is working better now!

Comment by Debasish Nayak [ 17/Nov/15 ]

Hi Spencer,

Looks like increasing the open file count did the trick. The router hasn't gone down since then.

Thanks a lot for the help.

Thanks
Debasish

Comment by Spencer Jackson [ 16/Nov/15 ]

Hi debasishn, just checking in. How is your cluster behaving with the new file handle limits?

Comment by Debasish Nayak [ 11/Nov/15 ]

I've increased the open files limit as per the recommendation to 64000. I'll monitor the service and update the result in couple of days time. Thank you Spencer for the reply.

Comment by Spencer Jackson [ 10/Nov/15 ]

debasishn, it looks like you have an open file limit of 1024 simultaneous files. I'm getting that from the line

open files (-n) 1024

MongoDB will need to open /dev/urandom while performing SCRAM-SHA-1 authentication, which might be putting your open file handles over the edge. Take a look at our recommended ulimit settings here: https://docs.mongodb.org/manual/reference/ulimit/

Does setting this value higher, per that document, fix this problem?

Comment by Debasish Nayak [ 10/Nov/15 ]

Hi Ramon,

Below is what I've without changing anything yet. Its already set to unlimited.

root@ip-10-83-19-157:~# ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 30038
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 30038
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Comment by Ramon Fernandez Marina [ 06/Nov/15 ]

debasishn, this ticket looks like a duplicate of SERVER-21099, and it may be triggered by having a low limit for the number of open files the mongod server can have. Can you please try increasing that limit and let us know whether the issue goes away?

Generated at Thu Feb 08 03:56:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.