- A config server calls shardCollection
- The shard begins shardCollection
- The shard writes chunks and the metadata entry for the collection.
- The config server steps down, cancelling its shardCollection command.
- A new config server steps up.
- The new config server retries the shardCollection command.
- The new config server sees that the metadata entry for the collection has been written, erroneously assuming that the existence of a metadata entry implies that the shard has finished its shardCollection command. This in turn causes the distributed lock to be released, meaning chunk migrations and splits can get in.
- A subsequent moveChunk operation can acquire the collection dist lock and because of this can attempt acquiring the critical section, which currently crashes the server.
A config server should not be able to early return if the shard's shardCollection command is not complete.