|
The chunk migration commit procedure can cause a fassert on the donor shard if for any reason the CommitChunkMigration command to the config server fails and we cannot perform a follow-up write to the config server to obtain the latest optime. The donor shard needs the latest optime because in the case of an unknown commit result it clears its chunk metadata for a total refresh the next time metadata is needed: the donor shard must have the latest optime to assure acquisition of the latest chunk metadata on this refresh. If a donor shard were not to see the latest chunk metadata, routing guarantees would break as the donor allows reads to data that may already have changed on the migration recipient shard.
Rather than fasserting in the commit, a flag should be set on the collection chunk metadata, which will cause the next refresh to first get the latest optime from the config server. Acquiring the latest optime on the config server will require a write operation, which means there must be a config primary at the time – refresh, in comparison, is a read on the config server and does not require a primary. If the latest optime cannot be acquired, then the flag will remain set and the command needing the collection chunk metadata refresh will fail.
The flag may need to be persisted in case the server crashes and restarts — I’m presuming the lastOpTime is persisted somewhere to be safe from crashes as well, otherwise we’d be running blind on refreshes right now?
This would be logically cleaner, as the acquisition of the latest optime happens immediately before the action that needs it, rather than confusingly in the chunk commit procedure far away from the reason.
|