Each DDL coordinator is calling _inserStateDocument to initially checkpoint the received operation on disk. Since the write is using a concern with timeout , it could happen the following:
- DDL coordinator starts and calls _insertStateDocument
- The document is locally written but not yet majority committed
- The write concern timeout is hit
- The coordinator retries
- The retry fails because the document had already been inserted so a DuplicateKey error is thrown
In some cases, such as for renameCollection, the result is that the DDL coordinator document remains on disk but the in-memory instance is released because of the exception. When this happens, the only way to resume the coordinator is either having the user invoke again the operation, either having a new node stepping on the source database's primary shard.
[EDIT] Also the rename participant can incur in the same problem since the implemented logic is the same.
- is duplicated by
SERVER-65340 Operations hang when re-using dropped unsharded collections
- is related to
SERVER-66336 ConfigsvrCoordinators initial checkpoint may incur in DuplicateKey error