[SERVER-83530] Handle QueryPlanKilled on shard_server_catalog_cache_loader_test unit test Created: 22/Nov/23  Updated: 31/Jan/24  Resolved: 21/Dec/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.3.0-rc0

Type: Bug Priority: Major - P3
Reporter: Jordi Serra Torrens Assignee: David Dominguez Sal
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Fix
Related
related to SERVER-86013 Fix retry for getChunksSince in shard... Closed
Assigned Teams:
Catalog and Routing
Backwards Compatibility: Fully Compatible
Sprint: CAR Team 2023-12-25
Participants:
Linked BF Score: 14
Story Points: 1

 Description   

ShardServerCatalogCacheLoader::getChunkSince can throw StaleConfig under some interleavings between reading the cache and the background thread that persists the materialized cache. In practice, the CatalogCache handles this by retrying, so it doesn't cause harm.
However, this race can cause failures on the shard_server_catalog_cache_loader_test unit test (e.g here). We can address this by making the test expect and retry this failure. Alternatively, we could make ShardServerCatalogCacheLoader retry itself.

The interleaving that can cause this is:
1. SSCCL discovers the new epoch.
2. Next, it schedules an asynchronous task to update the persisted metadata.
3. Next, it calls `_getLoaderMetadata`, which calls `getIncompletePersistedMetadataSinceVersion`, which calls `getPersistedMetadataSinceVersion`, which finally calls `readShardChunks`. readShardChunks reads from the config.cache.xxxx collection.
4. Concurrently with the read (3), the task scheduled at (2) proceeds to drop the config.cache.xxxx collection (because the epoch has changed).
5. The read started at (3) yields and on restore it discovers that the collection no longer exists, therefore it fails with QueryPlanKilled.



 Comments   
Comment by Githook User [ 21/Dec/23 ]

Author:

{'name': 'david-dominguez-sal', 'email': '97509688+david-dominguez-sal@users.noreply.github.com', 'username': 'david-dominguez-sal'}

Message: SERVER-83530: Fix shard_server_catalog_cache_loader_test. Retry on recoverable errors the usages of getChunksSince. (#17657)

GitOrigin-RevId: 6a84385455be57880336ab8c2329825b52c24a72
Branch: master
https://github.com/mongodb/mongo/commit/d1f998f81838f79006f5b8c59e3ba5ac5e6096d2

Generated at Thu Feb 08 06:52:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.