Currently, in case of some errors during a migration (or migration recovery), the donor shard clears it's filtering metadata so that the migration will be recovered the next time a query attempts to use that collection. Some code paths trigger a best-effort recovery, while others don't. Even in the case of the best-effort attempt, it could fail to recover. This is correct, but with the new migration protocol (where the recipient takes the critical section) it may cause long periods of time where the recipient is holding both the critical section (causing collection unavailability) and also holding the ActiveMigrationRegistry (making the recipient shard unable to donate/receive chunks related to any other collection).
This ticket is to evaluate making sure that the migration recovery is retried until success.