SERVER-90768 introduced a test hook which checks the consistency between the output of listCollections, listIndexes and $listCatalog. For non-passthrough test suites, the hook runs whenever a mongod is shut down or restarted, using the following method:
- Connect to the mongod using a direct connection
- Step down the node
- Run the catalog consistency checker (calls listCollections, listIndexes, ...)
Even for sharded clusters with separate mongos/shardsvr/configsvr, this doesn't require accessing any remote metadata: Using a direct connection to a shardsvr's mongod allows operating on the node's local data, bypassing sharding (routing, shard version protocol, routing/filtering metadata refreshes, etc.).
However, for single-shard clusters with replica set endpoint and config shard (e.g. the sharding_auto_boostrap test suite, all feature flags variant), this assumption isn't true: On the one hand, most commands are forced to go through the sharding code paths. Additionally, when the cluster lacks a majority, it becomes unable to obtain routing/filtering metadata, due to waiting for replication on the ShardServerCatalogCacheLoader.
This wasn't initially found to be problematic because by the time the majority is lost, tests had generally loaded any required routing/filtering metadata in-memory. However, this behavior is flaky and known to fail is various scenarios:
- Routing (e.g. for listIndexes) looks up the routing information for the associated buckets namespace. Without the separate catalog cache (
SERVER-95393), fetching this information on a secondary needs to wait for replication on the ShardServerCatalogCacheLoader.
- With the separate catalog cache, if a request to a sharded collection is routed by a mongos, it will not be routed by the replica set endpoint, so mongods will not learn the routing metadata. When the hook later connects to the replica set endpoint, it will begin by sending the request as UNSHARDED, which will later trigger a refresh and wait for replication when checking the shard version. (This was currently working due to a workaround, which will be removed by SERVER-97511.)
Disable the catalog consistency checker for single-shard cluster with config shard and replica set endpoint. We can re-enable it when the filtering metadata refresh doesn't need to wait for replication, or by refactoring the hook to only run on replica sets with a majority.
- is depended on by
-
SERVER-97511 flushRoutingTableCacheUpdates and flushDatabaseCacheUpdates commands should not refresh the routing information
- In Progress
- is related to
-
SERVER-98707 Add back test coverage disabled due to replica set endpoint being unavailable for reads when lacking a majority
- Backlog