Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Won't Do
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.4.1
Component/s: Sharding
Labels:
None

Assigned Teams:

Sharding
Operating System:
ALL
Steps To Reproduce:

Hide

This can be reproduce easily with the following js test:

primary_config_server_blackholed_from_mongos_mine.js

To make it fail consistently, the kConfigReadSelector needs to be set to ReadPreference::Nearest.

Show
This can be reproduce easily with the following js test: primary_config_server_blackholed_from_mongos_mine.js To make it fail consistently, the kConfigReadSelector needs to be set to ReadPreference::Nearest.
Linked BF Score:
32
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Both ShardRegistry and CatalogCache lookups trigger a Shard::exhaustiveFindOnConfig() that has a default 30s timeout.

Consider the following scenario:

A mongos access its ShardRegistry producing a cache miss on the underlying ReadThroughCache.
A ShardRegistry::lookup targeting the nearest config replica set node is started.
Communication with that specific config replica set node is lost due to network partition
The RSM marks the host as failed.

All the subsequent requests that hit the same mongos, require access to the ShardRegistry, and arrive before the current lookup times out, they will all try to join the ongoing ShardRegistry::lookup started at 2.

All those requests will fail with `NetworkInterfaceExceededTimeLimit` as soon as the original lookup times out.

In practice even if we have more then one config server replica set node and even if we are using the ReadPreference::nearest to fetch data from them. If we loose communication to one of them, it can happen that the mongos will not be able to serve any request for up to 30 secs.

The same reasoning can be applied to the CatalogCache because it also builds on top of the ReadThroughCache and implements the lookups through the same Shard::exhaustiveFindOnConfig().

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

primary_config_server_blackholed_from_mongos_mine.js
1 kB
Oct 06 2020 06:30:56 PM UTC

related to

SERVER-51406 Wait for failed configsvr replica set host discovery

Closed

Assignee:: [DO NOT USE] Backlog - Sharding Team
Reporter:: Tommaso Tocci
Participants:: [DO NOT USE] Backlog - Sharding Team, Kaloian Manassiev, Tommaso Tocci
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Oct 06 2020 05:32:54 PM UTC
Updated:: Dec 06 2022 02:06:46 AM UTC
Resolved:: Oct 22 2020 03:57:10 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates