-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
Affects Version/s: None
-
Component/s: None
-
None
-
Catalog and Routing
-
Fully Compatible
-
CAR Team 2023-12-25
-
14
-
1
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
ShardServerCatalogCacheLoader::getChunkSince can throw StaleConfig under some interleavings between reading the cache and the background thread that persists the materialized cache. In practice, the CatalogCache handles this by retrying, so it doesn't cause harm.
However, this race can cause failures on the shard_server_catalog_cache_loader_test unit test (e.g here). We can address this by making the test expect and retry this failure. Alternatively, we could make ShardServerCatalogCacheLoader retry itself.
The interleaving that can cause this is:
1. SSCCL discovers the new epoch.
2. Next, it schedules an asynchronous task to update the persisted metadata.
3. Next, it calls `_getLoaderMetadata`, which calls `getIncompletePersistedMetadataSinceVersion`, which calls `getPersistedMetadataSinceVersion`, which finally calls `readShardChunks`. readShardChunks reads from the config.cache.xxxx collection.
4. Concurrently with the read (3), the task scheduled at (2) proceeds to drop the config.cache.xxxx collection (because the epoch has changed).
5. The read started at (3) yields and on restore it discovers that the collection no longer exists, therefore it fails with QueryPlanKilled.
- related to
-
SERVER-86013 Fix retry for getChunksSince in shard_server_catalog_cache_loader_test
-
- Closed
-