Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-47025

moveChunk after refine shard key can hang indefinitely due to missing shard key index

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Backlog
    • Major - P3
    • Resolution: Unresolved
    • None
    • None
    • Sharding
    • Sharding EMEA
    • ALL
    • v4.4
    • Sharding 2020-04-06, Sharding 2020-04-20, Sharding 2020-05-04, Sharding 2020-05-18, Sharding 2020-07-13, Sharding 2020-06-01, Sharding 2020-06-15, Sharding 2020-06-29, Sharding 2020-07-27, Sharding 2020-08-24

    Description

      When the resumable range deleter is disabled, the recipient of a chunk starts by removing potentially orphaned documents. After that, it clones necessary indexes from the donor.

      However, the range deleter relies on the shard key index in order to perform deletions.

      This can lead to the following scenario:
      1. A moveChunk begins
      2. The shard key is refined
      3. The moveChunk fails on the recipient for some reason, causing the entire moveChunk to fail
      4. The moveChunk is restarted, now with a refined shard key
      5. The recipient of the moveChunk attempts to delete the incoming range using the range deleter with the refined shard key
      6. The range deleter loops infinitely because it is unable to find a shard key index.

      There may be less convoluted scenarios that could cause this as well but I'm having trouble thinking of one.

      Repro attached.

      Attachments

        Issue Links

          Activity

            People

              backlog-server-sharding-emea Backlog - Sharding EMEA
              matthew.saltz@mongodb.com Matthew Saltz (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: