Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9916

be smarter about config server retries in non-responsive situations

    • Type: Icon: Improvement Improvement
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.5.0
    • Component/s: Sharding
    • Labels:
      None

      In particular failure modes, retries to a failed config server can take several seconds and block queries to secondary and tertiary config servers. When possible, we should be smarter about reading from other config servers when a server is unavailable. This especially impacts authenticated clusters, since authentication data is not cached in mongos, so new authenticated connections are initially slow to respond.

      Example:
      1. First config server goes down and is unresponsive to the network, but does not reject packets.
      2. A new authenticated connection is created to mongos.
      3. Mongos tries to read from the first config server, and before the read tries to reconnect. This eventually fails, but not until the several second timeout.
      4. Mongos successfully reads from the second config server, but the response time is bad.
      5. This continues to happen for future new connections, each new connection waits for the full timeout, despite the fact that the server is still unavailable.

            Assignee:
            Unassigned Unassigned
            Reporter:
            greg_10gen Greg Studer
            Votes:
            3 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: