[SERVER-66046] Resharding coordinator won't automatically abort the resharding operation when a recipient shard errors during its applying phase Created: 28/Apr/22  Updated: 29/Oct/23  Resolved: 08/Jun/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.3.0, 5.0.0, 6.0.0-rc3
Fix Version/s: 5.0.10, 6.0.0-rc10, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Nandini Bhartiya
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
is related to SERVER-63855 Make dbCheck work with resharding Backlog
is related to SERVER-66011 Enable internal_transactions_reshardi... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.0, v5.0
Sprint: Sharding NYC 2022-05-30, Sharding NYC 2022-06-13
Participants:
Story Points: 3

 Description   

While the recipient shards are in RecipientStateEnum::kApplying, they will continuously fetch oplog entries from writes on the donor shards and apply them. If there's a operation-fatal error while applying an oplog entries, the recipient shard will transition to RecipientStateEnum::kError and inform the coordinator shard.

[j0:s0:prim] | 2022-04-27T09:17:49.060+00:00 I  RESHARD  4956500 [ReshardingRecipientService-1] "Resharding operation recipient state machine failed","attr":{"namespace":"test0_fsmdb0.fsmcoll0","reshardingUUID":{"uuid":{"$uuid":"08d271ae-91a9-4f52-9b2f-7de7eb4a0a33"}},"error":"OplogOperationUnsupported: Command not supported during resharding: { oplogEntry: { op: \"c\", ns: \"test0_fsmdb0.fsmcoll0\", ui: UUID(\"07b07822-7c51-410d-85e2-c5d5d4060998\"), o: { dbCheck: \"test0_fsmdb0.fsmcoll0\", type: \"batch\", md5: \"d381a905564387e42a68127855fecdf6\", minKey: MinKey, maxKey: MaxKey, readTimestamp: Timestamp(1651051063, 135), applyOps: null }, ts: Timestamp(1651051063, 156), t: 1, v: 2, wall: new Date(1651051063854), _id: { clusterTime: Timestamp(1651051063, 156), ts: Timestamp(1651051063, 156) } } }"}
[j0:s0:prim] | 2022-04-27T09:17:49.061+00:00 I  RESHARD  5279506 [ReshardingRecipientService-1] "Transitioned resharding recipient state","attr":{"newState":"error","oldState":"applying","namespace":"test0_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"07b07822-7c51-410d-85e2-c5d5d4060998"}},"reshardingUUID":{"uuid":{"$uuid":"08d271ae-91a9-4f52-9b2f-7de7eb4a0a33"}}}

While the recipient shards are in RecipientStateEnum::kApplying, the coordinator shard is monitoring for an opportune moment to commit the resharding operation based on how caught up the recipient shards are to the writes on the donor shards. The coordinator shard won't realize that the recipient shards will never reach an opportune time to commit because the resharding operation must abort.

[j0:c:prim] | 2022-04-27T09:17:49.089+00:00 I  RESHARD  5391602 [ReshardingCoordinatorService-2] "Resharding operation waiting for an okay to enter critical section"
[j0:c:prim] | 2022-04-27T09:17:49.089+00:00 I  RESHARD  5392001 [ReshardingCoordinatorService-2] "Querying recipient shards for the remaining operation time","attr":{"namespace":"test0_fsmdb0.fsmcoll0"}
[j0:c:prim] | 2022-04-27T09:17:49.090+00:00 I  RESHARD  5392002 [ReshardingCoordinatorService-2] "Finished querying recipient shards for the remaining operation time","attr":{"namespace":"test0_fsmdb0.fsmcoll0","remainingTimeMillis":5163}
...
[j0:c:prim] | 2022-04-27T09:18:01.750+00:00 I  RESHARD  5392001 [ReshardingCoordinatorService-2] "Querying recipient shards for the remaining operation time","attr":{"namespace":"test0_fsmdb0.fsmcoll0"}
[j0:c:prim] | 2022-04-27T09:18:01.751+00:00 I  RESHARD  5392002 [ReshardingCoordinatorService-2] "Finished querying recipient shards for the remaining operation time","attr":{"namespace":"test0_fsmdb0.fsmcoll0","remainingTimeMillis":5163}

An operator can manually issue the abortReshardCollection command for the operation to cancel the resharding operation.



 Comments   
Comment by Githook User [ 09/Jun/22 ]

Author:

{'name': 'nandinibhartiyaMDB', 'email': 'nandini.bhartiya@mongodb.com', 'username': 'nandinibhartiyaMDB'}

Message: SERVER-66046: Abort resharding on recipient errors
Branch: v5.0
https://github.com/mongodb/mongo/commit/3d7c06f37ee34b36197d669a5f25e283b164c0eb

Comment by Githook User [ 09/Jun/22 ]

Author:

{'name': 'nandinibhartiyaMDB', 'email': 'nandini.bhartiya@mongodb.com', 'username': 'nandinibhartiyaMDB'}

Message: SERVER-66046: Abort resharding on recipient errors

(cherry picked from commit f016b1053908e031dbcec48ffb0a30fa63ba7e3d)
Branch: v6.0
https://github.com/mongodb/mongo/commit/52f58747e86fedf2ca095f16a64b41cc6c565034

Comment by Githook User [ 08/Jun/22 ]

Author:

{'name': 'nandinibhartiyaMDB', 'email': 'nandini.bhartiya@mongodb.com', 'username': 'nandinibhartiyaMDB'}

Message: SERVER-66046: Abort resharding on recipient errors
Branch: master
https://github.com/mongodb/mongo/commit/f016b1053908e031dbcec48ffb0a30fa63ba7e3d

Comment by Max Hirschhorn [ 09/May/22 ]

We think a possible solution would be to do whenAny(_canEnterCritical.getFuture(), _reshardingCoordinatorObserver->awaitAllRecipientsInStrictConsistency()) to fail early when a recipient shard will never reach strict consistency after the commit monitor has been started.

Generated at Thu Feb 08 06:04:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.