Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-92857

Resharding Coordinator's abort hangs if it encounters an unrecoverable error while establishing participants

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 6.0.0
    • Component/s: None
    • Cluster Scalability
    • ALL
    • Hide

      A quick way to reproduce this behavior is to simulate a WriteConcernFailed error from any of the participants while executing the _flushRoutingTableCacheUpdatesWithWriteConcern command. Another way is to return a fake error status here

      Will add further details after investigating the cause of the WriteConcernError.

      Show
      A quick way to reproduce this behavior is to simulate a WriteConcernFailed error from any of the participants while executing the _flushRoutingTableCacheUpdatesWithWriteConcern command. Another way is to return a fake error status here .  Will add further details after investigating the cause of the WriteConcernError.

      As seen in HELP-62301, if the ReshardingCoordinator receives an unexpected error (such as a WriteConcernFailed Error), while it is establishing the resharding participants (i.e. it is executing the _flushRoutingTableCacheUpdatesWithWriteConcern command) and it is in a state greater than "preparing-to-donate", it will try to abort itself and the participants; but the abort of the participants will hang indefinitely because it waits on its participants (which have not established their state machines and so do not undergo state transitions) to complete their state transition to Done.

      max.hirschhorn@mongodb.com explained this behavior in here

      (From Max: "we're getting stuck in the way resharding coordinator is written because it doesn't expect that error. the resharding coordinator expects that all shards are established as participants before it would run _shardsvrAbortReshardCollection on them but a non-recoverable error during establishing participants means that the resharding coordinator didn't establish all the participants yet tries to wait for an acknowledgment from all of themI think the general approach taken by the resharding coordinator can be revisited under the project to rewrite the resharding coordinator such that it doesn't rely on using the shard version protocol to prompt shards to make progress and instead have explicit, idempotent commands for each phase")

      (Note this has been investigated on v6.0, need to verify if we see the same behavior on later versions).

            Assignee:
            Unassigned Unassigned
            Reporter:
            nandini.bhartiya@mongodb.com Nandini Bhartiya
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: