[SERVER-60858] _configsvrReshardCollection command which joins existing ReshardingCoordinator may miss being interrupted on stepdown Created: 20/Oct/21  Updated: 29/Oct/23  Resolved: 21/Oct/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.0, 5.1.0-rc0
Fix Version/s: 5.2.0, 5.0.4, 5.1.0-rc2

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-57686 We need test coverage that runs resha... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.1, v5.0
Sprint: Sharding 2021-11-01
Participants:
Story Points: 1

 Description   

The primary shard for the database running the _shardsvrReshardCollection command will re-send the _configsvrReshardCollection to the config server primary following a retryable error. The new invocation of the _configsvrReshardCollection command will join the existing ReshardingCoordinator instance rather than constructing a new one. However, when this situation occurs, setAlwaysInterruptAtStepDownOrUp() won't have been called on the OperationContext for the _configsvrReshardCollection command. The coordinator document having been written future and the resharding operation completion future aren't guaranteed to become ready with an error on stepdown or shutdown. This leads the _configsvrReshardCollection command to continue running on the config server node after it has stepped down.

We should call setAlwaysInterruptAtStepDownOrUp() before waiting on these futures so that if the config server primary steps down then the primary shard for the database running the _shardsvrReshardCollection command will re-send the _configsvrReshardCollection to the new config server primary.

if (auto existingInstance =
        getExistingInstanceToJoin(opCtx, nss, request().getKey())) {
    // Join the existing resharding operation to prevent generating a new resharding
    // instance if the same command is issued consecutively due to client disconnect
    // etc.
    reshardCollectionJoinedExistingOperation.pauseWhileSet(opCtx);
    existingInstance.get()->getCoordinatorDocWrittenFuture().get(opCtx);
    return existingInstance;
}



 Comments   
Comment by Githook User [ 21/Oct/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-60858 Interrupt wait on joined ReshardingCoordinator on stepdown.

(cherry picked from commit 79cbc3d7168a4bebf185381bf57ac4769ddb3f42)
Branch: v5.0
https://github.com/mongodb/mongo/commit/e04ed6b3ab9893c6fabe377b565fc177c4715d87

Comment by Githook User [ 21/Oct/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-60858 Interrupt wait on joined ReshardingCoordinator on stepdown.

(cherry picked from commit 79cbc3d7168a4bebf185381bf57ac4769ddb3f42)
Branch: v5.1
https://github.com/mongodb/mongo/commit/c4bfc8514d6879e78fd4298e2702de37e5f933e3

Comment by Githook User [ 21/Oct/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-60858 Interrupt wait on joined ReshardingCoordinator on stepdown.
Branch: master
https://github.com/mongodb/mongo/commit/79cbc3d7168a4bebf185381bf57ac4769ddb3f42

Generated at Thu Feb 08 05:50:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.