-
Type: Question
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 3.4.24
-
Component/s: None
Hey Guys
We are seeing some strange behaviour in our mongo 3.4 db server. We had a db server outage event which lead to us increasing the specifications of our mongodb db server. In the process of us trying to determine the cause of the outage event, we noticed something peculiar with the mongodb connection count metric before and after the upgrade.
Prior to the upgrade, our db server connections fluctuated between ~12k during off peak times and ~26k during peak times. During off peak times, the connections come from our baseline level of application servers running in docker containers. Our application server is configured to have a connection pool size of 50 connections. During peak times the new connections come from new docker containers that are started in response to autoscaling activities. So it's not that the number of connections per application server instance has increased. The number of application server instances have increased leading to a higher connection count number
After the outage and db server upgrade, the db server connection count now fluctuate between ~24k and ~38k. As far as we can tell nothing has changed in terms of our traffic patterns, our application server configuration or our autoscaling configuration.
Our current theory is that the difference in behaviour is due to our previous database server being grossly under powered triggering some unpublicised mongodb recovery / salvage behaviour. Our old db server had 32gb of ram and the new one has 128gb of ram. Our theory is that the old db server suffered constantly from a low memory condition without us recognising it and it closed the idle db connections before the configured socket timeout of 30 minutes was reached.
Now with the db server having ample memory, during off peak times, it's not closing the connections from the application servers as often as it had to with the previous db server configuration. This leads to an off peak connection count of 24k instead of the previous 12k.
Does our analysis sound correct? Does such undocumented db server connection handling behaviour exist in the code?
Alternatively, another theory is that with the previous 32gb ram configuration, the 30 minute socket timeout was being enforced religiously. But now with the 128gb ram configuration, the 30 minute socket timeout is now being ignored through some undocumented db server behaviour.
Please have a look and see if the db server code backs up any of the above theories.
Thanks