Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-66046

Resharding coordinator won't automatically abort the resharding operation when a recipient shard errors during its applying phase

    • Fully Compatible
    • ALL
    • v6.0, v5.0
    • Sharding NYC 2022-05-30, Sharding NYC 2022-06-13
    • 3

      While the recipient shards are in RecipientStateEnum::kApplying, they will continuously fetch oplog entries from writes on the donor shards and apply them. If there's a operation-fatal error while applying an oplog entries, the recipient shard will transition to RecipientStateEnum::kError and inform the coordinator shard.

      [j0:s0:prim] | 2022-04-27T09:17:49.060+00:00 I  RESHARD  4956500 [ReshardingRecipientService-1] "Resharding operation recipient state machine failed","attr":{"namespace":"test0_fsmdb0.fsmcoll0","reshardingUUID":{"uuid":{"$uuid":"08d271ae-91a9-4f52-9b2f-7de7eb4a0a33"}},"error":"OplogOperationUnsupported: Command not supported during resharding: { oplogEntry: { op: \"c\", ns: \"test0_fsmdb0.fsmcoll0\", ui: UUID(\"07b07822-7c51-410d-85e2-c5d5d4060998\"), o: { dbCheck: \"test0_fsmdb0.fsmcoll0\", type: \"batch\", md5: \"d381a905564387e42a68127855fecdf6\", minKey: MinKey, maxKey: MaxKey, readTimestamp: Timestamp(1651051063, 135), applyOps: null }, ts: Timestamp(1651051063, 156), t: 1, v: 2, wall: new Date(1651051063854), _id: { clusterTime: Timestamp(1651051063, 156), ts: Timestamp(1651051063, 156) } } }"}
      [j0:s0:prim] | 2022-04-27T09:17:49.061+00:00 I  RESHARD  5279506 [ReshardingRecipientService-1] "Transitioned resharding recipient state","attr":{"newState":"error","oldState":"applying","namespace":"test0_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"07b07822-7c51-410d-85e2-c5d5d4060998"}},"reshardingUUID":{"uuid":{"$uuid":"08d271ae-91a9-4f52-9b2f-7de7eb4a0a33"}}}
      

      While the recipient shards are in RecipientStateEnum::kApplying, the coordinator shard is monitoring for an opportune moment to commit the resharding operation based on how caught up the recipient shards are to the writes on the donor shards. The coordinator shard won't realize that the recipient shards will never reach an opportune time to commit because the resharding operation must abort.

      [j0:c:prim] | 2022-04-27T09:17:49.089+00:00 I  RESHARD  5391602 [ReshardingCoordinatorService-2] "Resharding operation waiting for an okay to enter critical section"
      [j0:c:prim] | 2022-04-27T09:17:49.089+00:00 I  RESHARD  5392001 [ReshardingCoordinatorService-2] "Querying recipient shards for the remaining operation time","attr":{"namespace":"test0_fsmdb0.fsmcoll0"}
      [j0:c:prim] | 2022-04-27T09:17:49.090+00:00 I  RESHARD  5392002 [ReshardingCoordinatorService-2] "Finished querying recipient shards for the remaining operation time","attr":{"namespace":"test0_fsmdb0.fsmcoll0","remainingTimeMillis":5163}
      ...
      [j0:c:prim] | 2022-04-27T09:18:01.750+00:00 I  RESHARD  5392001 [ReshardingCoordinatorService-2] "Querying recipient shards for the remaining operation time","attr":{"namespace":"test0_fsmdb0.fsmcoll0"}
      [j0:c:prim] | 2022-04-27T09:18:01.751+00:00 I  RESHARD  5392002 [ReshardingCoordinatorService-2] "Finished querying recipient shards for the remaining operation time","attr":{"namespace":"test0_fsmdb0.fsmcoll0","remainingTimeMillis":5163}
      

      An operator can manually issue the abortReshardCollection command for the operation to cancel the resharding operation.

            Assignee:
            nandini.bhartiya@mongodb.com Nandini Bhartiya
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: