[SERVER-61985] resharding_coordinator_recovers_abort_decision.js may report resharding operation as succeeding due to primary shard retrying _configsvrReshardCollection and running a second resharding operation Created: 10/Dec/21  Updated: 29/Oct/23  Resolved: 18/Jul/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.0.11, 6.0.2, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Abdul Qadeer
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
Related
related to SERVER-73916 Improve ReshardingTest fixture error ... Closed
is related to SERVER-52730 Restrict there to be at most one resh... Closed
is related to SERVER-61607 Accept DuplicateKey as a possible err... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.0, v5.0
Sprint: Sharding 2022-06-27, Sharding 2022-07-11
Participants:
Linked BF Score: 163
Story Points: 3

 Description   

The ReshardingTest fixture configures the reshardingPauseCoordinatorBeforeCompletion with {times: 1} which means that it is automatically disabled once it is reached by a ReshardingCoordinator. The failpoint is automatically disabled once it has been reached and therefore won't actually pause the ReshardingCoordinator. This is problematic for cases where the reshardCollection command is expected to error (i.e. tests which use expectedErrorCode !== ErrorCodes.OK) because the _configsvrReshardCollection can be retried by the primary shard and will have forgotten about an earlier aborted resharding. This can lead an entire second resharding operation to run and, because it runs entirely after the duringReshardingFn finished executing, it won't also abort like the first resharding operation.

We should revert the changes to the ReshardingTest fixture from 38c6aff as part of SERVER-52730 so the ReshardingCoordinator remains paused. This will require devising a different solution to not having the resharding_prohibited_commands.js test running a second reshardCollection command get stuck, which can likely be done by passing data into the reshardingPauseCoordinatorBeforeCompletion failpoint to only pause the ReshardingCoordinator for a particular source namespace.

We should also revert the test changes to resharding_nonblocking_coordinator_rebuild.js from SERVER-61607 because I hadn't realized the problematic behavior with the reshardingPauseCoordinatorBeforeCompletion failpoint being the culprit until now.



 Comments   
Comment by Max Hirschhorn [ 01/Sep/22 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-61985 SERVER-67193 Make reshardingPauseCoordinatorBeforeCompletion failpoint pause conditionally

(cherry picked from commit 0d5fd57f9e55915550dd7d13340e2944c169c6e2)
Branch: v6.0
https://github.com/mongodb/mongo/commit/32f1d5772af14cecc6df7d06eed6e850aa39d0ee

Comment by Max Hirschhorn [ 27/Jul/22 ]

Thank you matthew.russotto@mongodb.com, I filed BF-25959 to track the Evergreen failure you observed. abdul.qadeer@mongodb.com, let's hold off on the 6.0 backport until we better understand why the _configsvrReshardCollection command is being issued twice despite the changes to the reshardingPauseCoordinatorBeforeCompletion failpoint behavior.

Comment by Matthew Russotto [ 27/Jul/22 ]

This appears to have re-introduced the bug from SERVER-61607.

https://evergreen.mongodb.com/task/mongodb_mongo_v5.0_enterprise_rhel_80_64_bit_dynamic_required_sharding_op_query_3_enterprise_rhel_80_64_bit_dynamic_required_patch_977e631b79b0409905729d4bb89137c2bb0f865f_62e00c7d850e615c232dd5b4_22_07_26_15_47_13

(note: for some reason I cannot re-open this issue. Not sure if it's a permissions problem or because a backport has been released or what).

Comment by Githook User [ 21/Jul/22 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-61985 SERVER-67193 Make reshardingPauseCoordinatorBeforeCompletion failpoint pause conditionally
Branch: v5.0
https://github.com/mongodb/mongo/commit/36eb019a6ecbe610f2f0a8205e4fb097c957d24f

Comment by Githook User [ 18/Jul/22 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-61985 SERVER-67193 Make reshardingPauseCoordinatorBeforeCompletion failpoint pause conditionally
Branch: master
https://github.com/mongodb/mongo/commit/0d5fd57f9e55915550dd7d13340e2944c169c6e2

Comment by Githook User [ 01/Jul/22 ]

Author:

{'name': 'auto-revert-processor', 'email': 'dev-prod-dag@mongodb.com'}

Message: Revert "SERVER-61985 Make reshardingPauseCoordinatorBeforeCompletion failpoint pause conditionally"

This reverts commit 88b5b28f901211cb63099b98e3c576826d82e68d.
Branch: master
https://github.com/mongodb/mongo/commit/ca8d1708e9fb26734691a177d79bfdc78febc7c5

Comment by Githook User [ 30/Jun/22 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-61985 Make reshardingPauseCoordinatorBeforeCompletion failpoint pause conditionally
Branch: master
https://github.com/mongodb/mongo/commit/88b5b28f901211cb63099b98e3c576826d82e68d

Generated at Thu Feb 08 05:53:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.