Expected: All pool connections reconnect properly after a new primary is elected
Observed: Some connections in the pool queue queries indefinitely
Given a three-machine Mongo 3.2.17 replicaset with at least one sharded collection, and connecting to the replicaset from a fourth machine through a local mongos under node-mongodb-native 2.2.33 (and all other versions we tested), we find that when we lose a primary abruptly (e.g. the primary machine or process crashes) though the replicaset elects a new primary just fine and this is reflected in the mongos logs, node-mongodb-native ends up with some connections in its pool hung indefinitely, queueing queries without either completing them or returning errors.
Here is a test script that will demonstrate the problem when run against that configuration:
(also attached as repeatCounts.js)
The surest way to reproduce the issue to run something like that script and kill the network abruptly on the replicaset primary, e.g. with `sudo ifconfig eth0 down`.
We have been unable to find any mongodb, mongos, or node-mongodb-native configuration options that make this behave as expected (that is, the bad connections to mongos reconnect), and have resorted to detecting this condition in application code by looking for queries stacking up or hanging, but this takes longer to detect than we would like, leading to a partial or complete outage until the bad connections are detected.