Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-51397

Mongos fails to serve requests for 30 secs when losing comm with one config replica set node

    • Type: Icon: Bug Bug
    • Resolution: Won't Do
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 4.4.1
    • Component/s: Sharding
    • Labels:
      None
    • Sharding
    • ALL
    • Hide

      This can be reproduce easily with the following js test:

      primary_config_server_blackholed_from_mongos_mine.js

      To make it fail consistently, the kConfigReadSelector needs to be set to ReadPreference::Nearest.

      Show
      This can be reproduce easily with the following js test: primary_config_server_blackholed_from_mongos_mine.js To make it fail consistently, the kConfigReadSelector needs to be set to ReadPreference::Nearest.
    • 32

      Both ShardRegistry and CatalogCache lookups trigger a Shard::exhaustiveFindOnConfig() that has a default 30s timeout.

      Consider the following scenario:

      1. A mongos access its ShardRegistry producing a cache miss on the underlying ReadThroughCache.
      2. A ShardRegistry::lookup targeting the nearest config replica set node is started.
      3. Communication with that specific config replica set node is lost due to network partition
      4. The RSM marks the host as failed.

      All the subsequent requests that hit the same mongos, require access to the ShardRegistry, and arrive before the current lookup times out, they will all try to join the ongoing ShardRegistry::lookup started at 2.

      All those requests will fail with `NetworkInterfaceExceededTimeLimit` as soon as the original lookup times out.

       

      In practice even if we have more then one config server replica set node and even if we are using the ReadPreference::nearest to fetch data from them. If we loose communication to one of them, it can happen that the mongos will not be able to serve any request for up to 30 secs.

      The same reasoning can be applied to the CatalogCache because it also builds on top of the ReadThroughCache and implements the lookups through the same Shard::exhaustiveFindOnConfig().

            Assignee:
            backlog-server-sharding [DO NOT USE] Backlog - Sharding Team
            Reporter:
            tommaso.tocci@mongodb.com Tommaso Tocci
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: