Major - P3
sharded cluster, 3 config servers, auth
For MongoDB sharded clusters with authentication enabled, authentication requests on new connections can query the first config server if authentication data is not already cached. If this config server is unresponsive, there is a 30 second timeout after which the next config server is contacted. These long 30-second timeouts sometimes cause delays on new connections, manifesting as slow queries or other operations. An internal internalSCCAllowFastestAuthConfigReads mongos server parameter was added to enable reading authentication data from the first-to-respond config server.
In authenticated environments, when the first config server becomes unresponsive (note: this is different from the config server shutting down as connections would then fail immediately) and authentication data is not cached, queries and other operations can be delayed by up to 30 seconds.
The preferred workaround is to block the first config server using a firewall (e.g. with iptables) to make connections to it fail immediately. In this case, the second config server is contacted without the 30-second delay. If this is not possible, the internal mongos parameter internalSCCAllowFastestAuthConfigReads can be used to workaround the issue.
All previous versions are affected by this issue.
The fix is included in the 2.6.2 production release.
For authentication requests (and only for those), a parameter internalSCCAllowFastestAuthConfigReads was added to allow all three config servers to be queried concurrently. To ensure consistent reads of all other metadata, all other requests use the normal mechanism of contacting the first config server, with a 30-second timeout.
Normal collection operations, do not touch config server.
But other things do.
- creating database
- creating collection
- send reads to all (maybe with a tiny backoff), respond from first response (maybe with threshold) (preferred)
- blacklist (a bit ugly + racy)
- is duplicated by
SERVER-13323 listDBs block when first mongo config server is down
SERVER-9916 be smarter about config server retries in non-responsive situations