On StaleConfig errors, a secondary node triggers a refresh on the primary, which sets a "refreshing" flag in the config.cache.collections entry for the namespace being refreshed and then unsets it when the refresh completes. Before reading from its replicated metadata cache collections after triggering the refresh, the secondary will check if the "refreshing" flag is set to true, and if so, wait on a notification that is triggered by an op observer when an update to the cache entry for the refreshing namespace is replicated.
The op observer will only trigger the notification if the replicated update used a $set that both contains the "lastRefreshedCollectionVersion" field and set "refreshing" to false. The collection version is stored on disk without an epoch, so if a refresh discovers a collection's epoch changed, but not the timestamp component of its collection version (which can happen after a collection's shard key is refined), then the replicated update from the primary will omit "lastRefreshedCollectionVersion" and the secondary's op observer will not notify the refreshing thread, causing it to hang.
A possible fix is to change the op observer to only check that the "refreshing" field was set to false.