It's possible that:
- The last shard sends to the coordinator a write indicating that it has finished the resharding operation.
- The write gets lost due to a network error.
- The shard sends the write a second time, which the coordinator applies, finishing the resharding operation.
- The first write from the shard finally gets sent, triggering the invariant that the resharding instance for that UUID still exists.
Max Hirschhorn proposed two solutions:
- Have the donors and recipients use retryable writes so that the shard doesn't attempt the write again if it's already been applied, and
- Have the donors and recipients use a precondition that will make the write a no-op if the write has already been applied.