Priority: Major - P3
Affects Version/s: 2.5.0
Fix Version/s: None
In particular failure modes, retries to a failed config server can take several seconds and block queries to secondary and tertiary config servers. When possible, we should be smarter about reading from other config servers when a server is unavailable. This especially impacts authenticated clusters, since authentication data is not cached in mongos, so new authenticated connections are initially slow to respond.
1. First config server goes down and is unresponsive to the network, but does not reject packets.
2. A new authenticated connection is created to mongos.
3. Mongos tries to read from the first config server, and before the read tries to reconnect. This eventually fails, but not until the several second timeout.
4. Mongos successfully reads from the second config server, but the response time is bad.
5. This continues to happen for future new connections, each new connection waits for the full timeout, despite the fact that the server is still unavailable.