Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-104494

Resharding can hang if recipient steps down during ShardsvrReshardRecipientClone

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Cluster Scalability
    • ALL
    • None
    • 3
    • TBD
    • None
    • None
    • None
    • None
    • None
    • None

      During a reshardCollection operation, if a recipient shard steps down while processing the ShardsvrReshardRecipientCloneCommand before persisting recipient state document and awaiting the completion of majority write, the command can hang indefinitely.

      This hang occurs because the command waits on a future _transitionedToCreateCollection that is not fulfilled with an error during the recipient service's mandatory cleanup on stepdown. The command's operation context provides a cancellation token, which is intended to allow interruption during shard stepDown. However, the method used to hook this cancellation  – setAlwaysInterruptAtStepDownOrUp_UNSAFE is unsafe because it does not properly synchronize with RSTL, potentially preventing the command from being interrupted as expected during stepDown if it misses the state transition. As a result the reshard operation doesn't make forward progress.

      I think one way to resume the progress would be to failover the config server primary what would make it retry the command.

            Assignee:
            Unassigned Unassigned
            Reporter:
            abdul.qadeer@mongodb.com Abdul Qadeer
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: