The CatalogCache refresh methods have a logic that will simply 'join' with an another thread if there is already another thread trying to refresh the same collection. This will cause the refresh to miss the changes that happened at Tn if the refresh method was called while an in-progress refresh started at Tn-1.
Here's a concrete example based on the build failure:
1. moveChunk command is called.
2. At the end of the migration, source shardA sends setShardVersion command to recipient shardB asynchronously.
3. setShardVersion command triggers a CatalogCache refresh at shardB.
4. Since step 3 is happening asynchronously, the test can proceed to dropping the collection.
5. At the final step of drop is to send setShardVersion (0, 0) to shards.
6. When setShardVersion arrives at shardB while step 3 is still refreshing, it will simply join with it and get an old info where the collection has not dropped yet.
7. setShardVersion on drop fails since version (0, 0) would not match the version found in the refresh.
After `test.foo` is dropped, the drop is not reflected in the catalogCache. After refreshing, the shard logs that it refreshed from the old version:and the test fails because when trying to setShardVersion during the drop, the correct version is sent (version 0|0||000000000000000000000000), but the incorrect version is found.
Attached is the test to reproduce the error as well as the logs.