Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-47025

moveChunk after refine shard key can hang indefinitely due to missing shard key index

    • Sharding EMEA
    • Fully Compatible
    • ALL
    • v4.4
    • Sharding 2020-04-06, Sharding 2020-04-20, Sharding 2020-05-04, Sharding 2020-05-18, Sharding 2020-07-13, Sharding 2020-06-01, Sharding 2020-06-15, Sharding 2020-06-29, Sharding 2020-07-27, Sharding 2020-08-24

      When the resumable range deleter is disabled, the recipient of a chunk starts by removing potentially orphaned documents. After that, it clones necessary indexes from the donor.

      However, the range deleter relies on the shard key index in order to perform deletions.

      This can lead to the following scenario:
      1. A moveChunk begins
      2. The shard key is refined
      3. The moveChunk fails on the recipient for some reason, causing the entire moveChunk to fail
      4. The moveChunk is restarted, now with a refined shard key
      5. The recipient of the moveChunk attempts to delete the incoming range using the range deleter with the refined shard key
      6. The range deleter loops infinitely because it is unable to find a shard key index.

      There may be less convoluted scenarios that could cause this as well but I'm having trouble thinking of one.

      Repro attached.

            backlog-server-sharding-emea [DO NOT USE] Backlog - Sharding EMEA
            matthew.saltz@mongodb.com Matthew Saltz (Inactive)
            0 Vote for this issue
            12 Start watching this issue