-
Type: Bug
-
Resolution: Won't Do
-
Priority: Major - P3
-
None
-
Affects Version/s: 4.4.1
-
Component/s: Sharding
-
None
-
Sharding
-
ALL
-
-
32
Both ShardRegistry and CatalogCache lookups trigger a Shard::exhaustiveFindOnConfig() that has a default 30s timeout.
Consider the following scenario:
- A mongos access its ShardRegistry producing a cache miss on the underlying ReadThroughCache.
- A ShardRegistry::lookup targeting the nearest config replica set node is started.
- Communication with that specific config replica set node is lost due to network partition
- The RSM marks the host as failed.
All the subsequent requests that hit the same mongos, require access to the ShardRegistry, and arrive before the current lookup times out, they will all try to join the ongoing ShardRegistry::lookup started at 2.
All those requests will fail with `NetworkInterfaceExceededTimeLimit` as soon as the original lookup times out.
In practice even if we have more then one config server replica set node and even if we are using the ReadPreference::nearest to fetch data from them. If we loose communication to one of them, it can happen that the mongos will not be able to serve any request for up to 30 secs.
The same reasoning can be applied to the CatalogCache because it also builds on top of the ReadThroughCache and implements the lookups through the same Shard::exhaustiveFindOnConfig().
- related to
-
SERVER-51406 Wait for failed configsvr replica set host discovery
- Closed