The primary shard for the database running the _shardsvrReshardCollection command will re-send the _configsvrReshardCollection to the config server primary following a retryable error. The new invocation of the _configsvrReshardCollection command will join the existing ReshardingCoordinator instance rather than constructing a new one. However, when this situation occurs, setAlwaysInterruptAtStepDownOrUp() won't have been called on the OperationContext for the _configsvrReshardCollection command. The coordinator document having been written future and the resharding operation completion future aren't guaranteed to become ready with an error on stepdown or shutdown. This leads the _configsvrReshardCollection command to continue running on the config server node after it has stepped down.
We should call setAlwaysInterruptAtStepDownOrUp() before waiting on these futures so that if the config server primary steps down then the primary shard for the database running the _shardsvrReshardCollection command will re-send the _configsvrReshardCollection to the new config server primary.
if (auto existingInstance = getExistingInstanceToJoin(opCtx, nss, request().getKey())) { // Join the existing resharding operation to prevent generating a new resharding // instance if the same command is issued consecutively due to client disconnect // etc. reshardCollectionJoinedExistingOperation.pauseWhileSet(opCtx); existingInstance.get()->getCoordinatorDocWrittenFuture().get(opCtx); return existingInstance; }
- is depended on by
-
SERVER-57686 We need test coverage that runs resharding in the face of elections
- Closed