There is a detailed description in the last comment of the BF linked.
When we shutdown the shard registry, we shut down and join the thread pool used by the shard registry. Shutting down the thread pool just changes the state to be joinRequired. This stops any new tasks from being scheduled, but requires the join to still wait for all ongoing tasks to complete without interrupting them. In the case shown in the BF, one of the outgoing requests never finished, thus causing the join to stall.
Some options to fix this:
- Shut down the shard registry after we interrupt ongoing operations, rather than before. This way we would know that there are no ongoing operations that need waiting for when we call join. I am not sure what other implications moving this shutdown may have, though.
- Interrupt ongoing lookups during shard registry shutdown. This would likely imply keeping cancellation tokens so that operations can be interrupted during shutdown.
- Add a timeout to the network calls done by the shard registry. This solution, though, could cause shard registry refresh failures during times other than shutdown.