[SERVER-51397] Mongos fails to serve requests for 30 secs when losing comm with one config replica set node Created: 06/Oct/20 Updated: 06/Dec/22 Resolved: 22/Oct/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tommaso Tocci | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Sharding
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | This can be reproduce easily with the following js test: primary_config_server_blackholed_from_mongos_mine.js To make it fail consistently, the kConfigReadSelector needs to be set to ReadPreference::Nearest. |
||||||||
| Participants: | |||||||||
| Linked BF Score: | 32 | ||||||||
| Description |
|
Both ShardRegistry and CatalogCache lookups trigger a Shard::exhaustiveFindOnConfig() that has a default 30s timeout. Consider the following scenario:
All the subsequent requests that hit the same mongos, require access to the ShardRegistry, and arrive before the current lookup times out, they will all try to join the ongoing ShardRegistry::lookup started at 2. All those requests will fail with `NetworkInterfaceExceededTimeLimit` as soon as the original lookup times out.
In practice even if we have more then one config server replica set node and even if we are using the ReadPreference::nearest to fetch data from them. If we loose communication to one of them, it can happen that the mongos will not be able to serve any request for up to 30 secs. The same reasoning can be applied to the CatalogCache because it also builds on top of the ReadThroughCache and implements the lookups through the same Shard::exhaustiveFindOnConfig(). |
| Comments |
| Comment by Kaloian Manassiev [ 22/Oct/20 ] |
|
Due to the reliability of modern data centre networks, this situation is extremely unlikely to happen in practice and if it does it will go away after 30 seconds. Because of this it is not something we will invest time improving. |