On step-up, shards will clear the filtering metadata and schedule a shard version refresh for the source collection and the temporary resharding collection. It is possible for the shard version refresh triggered through onShardVersionMismatch(..., boost::none /* shardVersionReceived */) to error and not complete the shard version refresh. This can leave a recipient shard waiting to learn all donor shards are prepared to donate or can leave a donor shard waiting to learn all recipient shards have finished cloning.
Shards must reattempt calling onShardVersionMismatch() until it succeeds to ensure forward progress for the DonorStateMachines and RecipientStateMachines. Wrapping the call in an AsyncTry is the likely implementation solution.
[js_test:resharding_secondary_recovers_temp_ns_metadata] d21274| 2021-12-05T15:08:21.280+00:00 I SHARDING 22720 [ShardServerCatalogCacheLoader::getChunksSince] "Command failed with a retryable error and will be retried","attr":{"command":{"_flushRoutingTableCacheUpdates":"reshardingDb.system.resharding.f187cc87-e5c0-48b8-8908-2a539ad6d709"},"error":"InterruptedDueToReplStateChange: operation was interrupted"} [js_test:resharding_secondary_recovers_temp_ns_metadata] d21274| 2021-12-05T15:08:21.280+00:00 I SH_REFR 4619903 [CatalogCache-0] "Error refreshing cached collection","attr":{"namespace":"reshardingDb.system.resharding.f187cc87-e5c0-48b8-8908-2a539ad6d709","durationMillis":336,"error":"InterruptedDueToReplStateChange: operation was interrupted"} ... [js_test:resharding_secondary_recovers_temp_ns_metadata] d21274| 2021-12-05T15:08:21.287+00:00 W RESHARD 5498101 [TriggerReshardingRecovery] "Error on deferred shardVersion recovery execution","attr":{"error":"InterruptedDueToReplStateChange: operation was interrupted"} ... [js_test:resharding_secondary_recovers_temp_ns_metadata] d21274| 2021-12-05T15:08:23.137+00:00 I REPL 5123005 [ReshardingRecipientService-2] "Rebuilding PrimaryOnlyService due to stepUp","attr":{"service":"ReshardingRecipientService"}