Consider a shard node, which just started and/or became primary and does not have any sharding metadata cached.
If many threads running sharded operations (i.e., operations containing a non-UNSHARDED version) arrive at the same time, all these threads will get StaleConfigException and will enter the refresh code here. From these threads, only one will do the refresh from the config server, but all of them will eventually call this line, which will do nothing if the metadata is already fresh, but in the end all these threads will acquire the collection X-lock and cause stalls on an already overloaded server.
In addition, all threads will redundantly process the new metadata.
The complete solution to fix this would be to serialize collection refreshes on the shard, outside of the synchronization already happening through the catalog cache.
A quick solution to the MODE_X aspect would be to add a check (under collection IS lock) just before the X lock is acquired to re-check that the version obtained from the CatalogCache is not different and skip acquiring the X-lock in this case.
- related to
SERVER-31595 Generate shardMaps outside MODE_X collection lock