-
Type:
Task
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
The essence of the guidance being sought might best be summarized as "when using the native Mongo driver, what is the recommended best practice for maintaining a durable connection to MongoDb from a long-running Node process?"
At present the code in LAP Collections (Node) process establishes a connection to Mongo once at startup time and then attempts to maintain that connection ad nauseum. It currently relies on the auto-reconnect feature of the Mongo connection to survive outages and listens for 'close' and 'reconnect' events on said connection to flag/mark the status of the connection as active/inactive.
This active/inactive flag is then used to short circuit any potential processing that would ultimately invoke Mongo operations. So if a request is received and the current status of the Mongo connection is seen as "inactive" based on receipt of the events mentioned earlier, then that processing is terminated immediately before attempting any Mongo operations.
As you would expect error handling is embedded in the individual calls to the Mongo driver (insert, update, upsert, etc.) but to re-iterate that in the current approach these calls will never be made if the current status of the connection is seen as 'inactive'.
Finally, there are a couple of problems that we have seen with the current approach:
(1) If we have more than one drop/reconnect with Mongo once the connection has been established from Collections, the 2nd-Nth of these drops do not fire the events that Collections is counting on to mark the connection internally as down, then back up (i.e. its only going to work for the first such outage.) This appears to be happening as code in the driver does a '.once()' to register internal listeners for these invents on the internal 'topology' object so they will be fired at most once..........
(2) Any Mongo outage on an active connection that lasts longer than 30 seconds will result in that connection becoming no longer usable by whatever Collections worker is holding that connection (auto reconnect will not happen), and the situation can only be rectified by restarting that worker Node process. In the case of a replica farm, if we lose all of them together for more than 30 seconds then all of the Node workers would need to be restarted when they come back up....
This second condition can be mitigated to some degree by tweaking config values in the Mongo autoreconnect settings – the current limitation of 30 seconds comes from a maximum number of autoconnect re-attempts every 1000 millisecs up to a maximum of 30 times, both of which are configurable....
So that's a summary of where we are currently – we would greatly appreciate any guidance of how we should change/update our approach. The goal is to be able to maintain durable connections to Mongo from within these long-running Node processes.