Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-104258

Resharding Can Hang If Recipient Fails During ShardsvrReshardRecipientClone

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 8.1.0-rc4
    • Component/s: None
    • Cluster Scalability
    • ClusterScalability Apr28-May09
    • 0
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None

      If the featureFlagReshardingCloneNoRefresh is enabled, the coordinator will use ShardsvrReshardRecipientClone to tell the recipients to begin cloning.

      When the coordinator sends this command, it will block until the command returns. The command itself will call fulfillAllDonorsPreparedToDonate, which fulfills a promise then waits for a future to complete. This future is using a cancellation token, however its source is the opCtx that is running the command, which has no knowledge of resharding or its progress. If the recipient receives an error before fulfilling _transitionedToCreateCollection, the command will therefore hang indefinitely, causing the coordinator to also hang.

      This can occur if any nonretryable error occurs in the recipient within this region. Until we do SERVER-102452, an obvious way for this to happen is for the wait for majority to fail here.

      Resolving this is also not as simple as using the abortSource as the cancellation token. This is only actually cancelled when the coordinator updates itself to aborting or calls ShardsvrAbortReshardCollectionCommand, which it won't do because it is busy hanging on ShardsvrReshardRecipientClone.

            Assignee:
            Unassigned Unassigned
            Reporter:
            brett.nawrocki@mongodb.com Brett Nawrocki
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: