createReshardingStateMachine is the function in charge of:
- Writing the state machine document on disk
- Creating the related POS instance
Unfortunately this function is not idempotent. In fact if during the first execution the opCtx gets interrupted between (1.) and (2.) on subsequent executions the function will try to execute (1.) it will fail with a DuplicateKey error and it will not execute (2.). Thus in this scenario the state machine document will be written on disk but the POS instance for the recipient/donor won't be actually installed and executed leaving the resharding operation in an "hang" state.
The createReshardingStateMachine is called as part of shard version recovery procedure, the operation context of this procedure is interrupted every time some thread enter the collection critical for instance as part of a chunk migration.
One possible solution would be to attempt the creation of the POS instance even in case we hit the DuplicateKey error on insertion.